What is calibration?
A plain-English explainer of the canonical accuracy metric for probability forecasts, the math behind the Brier score, and the empirical reason calibration alone barely predicts profit on Polymarket.
The two-paragraph version
Calibration is the property of a probability forecaster whose stated probabilities match the long-run frequency of the predicted outcome. A forecaster who says 70% on a long series of events is well-calibrated if the event actually resolves yes about 70% of the time. The Brier score formalizes this as the mean squared error between the stated probability and the realized outcome (coded 0 or 1). Lower Brier means better calibration. The metric was introduced by Glenn Brier in 1950 for weather forecasting and became the standard scalar measure of forecasting accuracy after the Murphy (1973) decomposition split it into reliability, resolution, and uncertainty components. Tetlock's Good Judgment Project used it as the primary outcome measure in the IARPA forecasting tournament that identified the superforecaster cohort.
On prediction markets, calibration is a remarkably weak predictor of profit. Across the full 8,656-wallet Polymarket cohort, Spearman rank correlation between raw Brier score and realized PnL is only +0.148. Among the top 100 wallets by realized PnL the relationship actually inverts: worse-calibrated wallets in that group earn more (Spearman +0.42 in the Convexly whale audit). The fundamental reason is that Polymarket PnL is fat-tailed (Hill tail index = 1.28; below the alpha = 2 threshold above which OLS variance is well-behaved), so a few large concentrated positions dominate realized profit. A trader who is well-calibrated but spreads tiny bets across many markets does not capture much of the available edge. This empirical fact is the entire reason Edge Score has three pillars instead of one, and the reason Convexly renamed its calibration pillar to posture.
What a Brier score actually measures
For a sequence of N binary forecasts, the Brier score is the mean squared error between the stated probability and the binary outcome:
BS = (1/N) · ∑ (pi − oi)2
For a single forecast of probability p on an outcome o coded 0 or 1, the contribution to the average is (p − o) squared. The range is [0, 1] for binary outcomes. A perfect forecast (p exactly equal to o) contributes 0; the worst possible forecast (p exactly opposite to o) contributes 1. A naive forecaster who always predicts 0.5 scores 0.25 across any binary sequence. A forecaster who always predicts the base rate of resolution scores approximately equal to the variance of the outcome.
Three reference numbers for context. On the Good Judgment Project's IARPA geopolitical tournament, the median forecaster scored around 0.20 and the top 2% (the superforecaster cohort) scored around 0.16. On weather forecasting, the US National Weather Service's published next-day precipitation probabilities run around 0.10. Prediction markets sit between these extremes, but the absolute Brier number depends heavily on the category mix (sports vs politics vs crypto vs weather) so cross-trader comparisons require a baseline adjustment.
Why calibration alone fails to predict profit on Polymarket
If calibration were the dominant skill on prediction markets, Brier score alone would predict profit rank cleanly. It does not. The Convexly V1 cohort study (8,656 wallets, sampled April 15-16 2026 from the Polymarket profit leaderboard) found a Spearman rank correlation of +0.148 between raw Brier score and realized signed log PnL. In Cohen's effect-size convention, that is a small effect. The relationship is real but it is not a dominant signal.
Two empirical reasons explain the gap. First, Polymarket realized PnL is fat-tailed: the Hill tail index estimated on the cohort is 1.28, below the alpha = 2 threshold above which sample variance is well-behaved. Below alpha = 1 the population mean of per-user PnL is formally undefined. In a fat-tailed distribution like this, the realized rank is dominated by a small number of very-large positions; the per-trade accuracy of the forecaster matters less than the accuracy-weighted-by-stake of the small subset of trades where the forecaster concentrated. Second, on the top tail of the distribution the relationship between Brier and PnL inverts: in the whale audit on the top-100-by-PnL cohort, the Spearman is +0.42, meaning that within the most-profitable cohort, worse-calibrated wallets earn more. The interpretation is that the top wallets take more positions on lower-probability events that pay larger when they hit.
The implication is not that calibration is useless. It is that calibration is one input to profit on a fat-tailed market, not the dominant input. A complete skill measure has to include behavioral pillars that capture the concentration and discipline patterns associated with large positive realized PnL on the training cohort. That is the empirical motivation behind the three-pillar Edge Score composite.
Baseline-adjusted Brier (skill-Brier)
Raw Brier scores are not directly comparable across traders who saw different markets. A trader who only bets on 50/50 toss-up markets has a different baseline Brier than a trader who bets on extreme-base-rate markets. The fix is the same fix used in any forecasting domain with heterogeneous baselines: subtract the trivial baseline.
Convexly defines skill-Brier as observed Brier minus the wallet's own marginal-frequency Brier (the Brier a trivial always-predict-the-base-rate forecaster would have scored on the same set of events). Negative values mean the wallet beats the baseline; positive values mean the wallet is worse than the baseline. This is the standard way to make Brier comparable across forecasters, and it aligns with how skill scores are constructed in the weather forecasting literature (Brier Skill Score, BSS).
Skill-Brier is the input to the Convexly posture pillar. The pillar is the standardized negation of skill-Brier (z(-skill_brier)) with a frozen OLS coefficient of +0.79 in the composite.
Why Convexly renamed the pillar from calibration to posture
The OLS coefficient on z(-skill_brier) in the V1 composite is +0.79. That sign rewards higher standardized values of negative-skill-Brier, which corresponds to worse calibration relative to the marginal-frequency baseline. Labeling that pillar "calibration" would have been misleading: the pillar does not reward forecasting accuracy in the traditional sense.
Renaming the pillar to posture in the V1 paper aligned the label with the direction of the effect. Posture captures whatever the sign-aligned component of skill-Brier tracks on this cohort, without overclaiming that the pillar measures calibration in the way that an external forecasting practitioner would expect that word to mean. The renaming is documented in detail in the methodology paper and in the upcoming /research/why-posture-not-calibration explainer.
Calibration vs resolution (the Murphy decomposition)
The Brier score decomposes into three components (Murphy 1973):
BS = Reliability − Resolution + Uncertainty
Reliability measures how close the forecaster's stated probabilities are to the realized base rates within each forecast bin. Lower reliability is better. It is what most non-technical readers mean by calibration. Resolution measures how varied the forecaster's predictions are across outcome categories: how well the forecaster discriminates events that did happen from events that did not. Higher resolution is better. Uncertainty is the irreducible variance of the outcome itself; it does not depend on the forecaster.
A forecaster who always predicts 50% on a 50/50 domain is perfectly calibrated (zero reliability error) but has zero resolution. A forecaster who always predicts the empirical base rate is perfectly calibrated on the marginal but adds no information. The forecasters who look good on a Brier ranking are those with good reliability AND high resolution. On Polymarket, the top-PnL wallets often score well on resolution while being mid-pack on reliability, because their edge comes from picking the small number of mispriced events correctly rather than from being perfectly calibrated across the whole catalog.
What calibration does NOT do
Calibration does not bound expected PnL. A perfectly calibrated forecaster who refuses to take any position makes zero profit. Calibration also does not separate skill from luck on a single forecaster's history; the per-wallet temporal holdout that addresses this is pre-registered as the Convexly V1.5 follow-up at AsPredicted #287368. And calibration on prediction markets is venue-specific: per the V1-M cross-venue paper, the same wallet's calibration relative to baseline differs materially between Polymarket and Manifold, partly because the category mixes differ and partly because the price discovery dynamics differ.
Where the methodology lives
The V1 methodology paper (full validation suite, Fama-French bootstrap null at 10,000 permutations, derivation of skill-Brier and the posture pillar coefficient) is at /research/edge-score-methodology-v1. The cross-venue extension is at /research/edge-score-methodology-v1m. The whale-audit study that documents the Brier-PnL sign flip in the top cohort is at /research/polymarket-whale-audit. Code and reproduction scripts are in the public Convexly repository.
Score a wallet
Paste any Polymarket wallet address at the analyzer to see the wallet's raw Brier, skill-Brier (baseline-adjusted), posture percentile, and the rest of the Edge Score pillars. Free, no signup, reads public on-chain data only.
Convexly publishes new methodology research roughly every 6-8 weeks plus the /learn series on a rolling cadence. Get the next paper in your inbox when it ships:
Frequently asked
What is calibration in forecasting?
How is the Brier score calculated?
What is a good Brier score?
Why is calibration a weak predictor of profit on Polymarket?
What is skill-Brier or baseline-adjusted Brier?
What is the difference between calibration and resolution?
Where can I check my own calibration?
Why does Convexly rename calibration to posture?
Related explainers
- /learn/edge-score: the composite skill measure that uses skill-Brier as one of its three pillars
- /learn/conviction: what concentration means, how to read a barbell PnL profile (coming soon)
- /learn/discipline: why position count predicts profit inversely on Polymarket and oppositely on Manifold (coming soon)
- /learn/kelly: fractional Kelly under fat tails, why full-Kelly is unsafe when alpha < 2 (coming soon)