Learn

What is calibration?

A plain-English explainer of the canonical accuracy metric for probability forecasts, the math behind the Brier score, and the empirical reason calibration alone barely predicts profit on Polymarket.

The two-paragraph version

Calibration is the property of a probability forecaster whose stated probabilities match the long-run frequency of the predicted outcome. A forecaster who says 70% on a long series of events is well-calibrated if the event actually resolves yes about 70% of the time. The Brier score formalizes this as the mean squared error between the stated probability and the realized outcome (coded 0 or 1). Lower Brier means better calibration. The metric was introduced by Glenn Brier in 1950 for weather forecasting and became the standard scalar measure of forecasting accuracy after the Murphy (1973) decomposition split it into reliability, resolution, and uncertainty components. Tetlock's Good Judgment Project used it as the primary outcome measure in the IARPA forecasting tournament that identified the superforecaster cohort.

On prediction markets, calibration is a remarkably weak predictor of profit. Across the full 8,656-wallet Polymarket cohort, Spearman rank correlation between raw Brier score and realized PnL is only +0.148. Among the top 100 wallets by realized PnL the relationship actually inverts: worse-calibrated wallets in that group earn more (Spearman +0.42 in the Convexly whale audit). The fundamental reason is that Polymarket PnL is fat-tailed (Hill tail index = 1.28; below the alpha = 2 threshold above which OLS variance is well-behaved), so a few large concentrated positions dominate realized profit. A trader who is well-calibrated but spreads tiny bets across many markets does not capture much of the available edge. This empirical fact is the entire reason Edge Score has three pillars instead of one, and the reason Convexly renamed its calibration pillar to posture.

What a Brier score actually measures

For a sequence of N binary forecasts, the Brier score is the mean squared error between the stated probability and the binary outcome:

BS = (1/N) · ∑ (pi − oi)2

For a single forecast of probability p on an outcome o coded 0 or 1, the contribution to the average is (p − o) squared. The range is [0, 1] for binary outcomes. A perfect forecast (p exactly equal to o) contributes 0; the worst possible forecast (p exactly opposite to o) contributes 1. A naive forecaster who always predicts 0.5 scores 0.25 across any binary sequence. A forecaster who always predicts the base rate of resolution scores approximately equal to the variance of the outcome.

Three reference numbers for context. On the Good Judgment Project's IARPA geopolitical tournament, the median forecaster scored around 0.20 and the top 2% (the superforecaster cohort) scored around 0.16. On weather forecasting, the US National Weather Service's published next-day precipitation probabilities run around 0.10. Prediction markets sit between these extremes, but the absolute Brier number depends heavily on the category mix (sports vs politics vs crypto vs weather) so cross-trader comparisons require a baseline adjustment.

Why calibration alone fails to predict profit on Polymarket

If calibration were the dominant skill on prediction markets, Brier score alone would predict profit rank cleanly. It does not. The Convexly V1 cohort study (8,656 wallets, sampled April 15-16 2026 from the Polymarket profit leaderboard) found a Spearman rank correlation of +0.148 between raw Brier score and realized signed log PnL. In Cohen's effect-size convention, that is a small effect. The relationship is real but it is not a dominant signal.

Two empirical reasons explain the gap. First, Polymarket realized PnL is fat-tailed: the Hill tail index estimated on the cohort is 1.28, below the alpha = 2 threshold above which sample variance is well-behaved. Below alpha = 1 the population mean of per-user PnL is formally undefined. In a fat-tailed distribution like this, the realized rank is dominated by a small number of very-large positions; the per-trade accuracy of the forecaster matters less than the accuracy-weighted-by-stake of the small subset of trades where the forecaster concentrated. Second, on the top tail of the distribution the relationship between Brier and PnL inverts: in the whale audit on the top-100-by-PnL cohort, the Spearman is +0.42, meaning that within the most-profitable cohort, worse-calibrated wallets earn more. The interpretation is that the top wallets take more positions on lower-probability events that pay larger when they hit.

The implication is not that calibration is useless. It is that calibration is one input to profit on a fat-tailed market, not the dominant input. A complete skill measure has to include behavioral pillars that capture the concentration and discipline patterns associated with large positive realized PnL on the training cohort. That is the empirical motivation behind the three-pillar Edge Score composite.

Baseline-adjusted Brier (skill-Brier)

Raw Brier scores are not directly comparable across traders who saw different markets. A trader who only bets on 50/50 toss-up markets has a different baseline Brier than a trader who bets on extreme-base-rate markets. The fix is the same fix used in any forecasting domain with heterogeneous baselines: subtract the trivial baseline.

Convexly defines skill-Brier as observed Brier minus the wallet's own marginal-frequency Brier (the Brier a trivial always-predict-the-base-rate forecaster would have scored on the same set of events). Negative values mean the wallet beats the baseline; positive values mean the wallet is worse than the baseline. This is the standard way to make Brier comparable across forecasters, and it aligns with how skill scores are constructed in the weather forecasting literature (Brier Skill Score, BSS).

Skill-Brier is the input to the Convexly posture pillar. The pillar is the standardized negation of skill-Brier (z(-skill_brier)) with a frozen OLS coefficient of +0.79 in the composite.

Why Convexly renamed the pillar from calibration to posture

The OLS coefficient on z(-skill_brier) in the V1 composite is +0.79. That sign rewards higher standardized values of negative-skill-Brier, which corresponds to worse calibration relative to the marginal-frequency baseline. Labeling that pillar "calibration" would have been misleading: the pillar does not reward forecasting accuracy in the traditional sense.

Renaming the pillar to posture in the V1 paper aligned the label with the direction of the effect. Posture captures whatever the sign-aligned component of skill-Brier tracks on this cohort, without overclaiming that the pillar measures calibration in the way that an external forecasting practitioner would expect that word to mean. The renaming is documented in detail in the methodology paper and in the upcoming /research/why-posture-not-calibration explainer.

Calibration vs resolution (the Murphy decomposition)

The Brier score decomposes into three components (Murphy 1973):

BS = Reliability − Resolution + Uncertainty

Reliability measures how close the forecaster's stated probabilities are to the realized base rates within each forecast bin. Lower reliability is better. It is what most non-technical readers mean by calibration. Resolution measures how varied the forecaster's predictions are across outcome categories: how well the forecaster discriminates events that did happen from events that did not. Higher resolution is better. Uncertainty is the irreducible variance of the outcome itself; it does not depend on the forecaster.

A forecaster who always predicts 50% on a 50/50 domain is perfectly calibrated (zero reliability error) but has zero resolution. A forecaster who always predicts the empirical base rate is perfectly calibrated on the marginal but adds no information. The forecasters who look good on a Brier ranking are those with good reliability AND high resolution. On Polymarket, the top-PnL wallets often score well on resolution while being mid-pack on reliability, because their edge comes from picking the small number of mispriced events correctly rather than from being perfectly calibrated across the whole catalog.

What calibration does NOT do

Calibration does not bound expected PnL. A perfectly calibrated forecaster who refuses to take any position makes zero profit. Calibration also does not separate skill from luck on a single forecaster's history; the per-wallet temporal holdout that addresses this is pre-registered as the Convexly V1.5 follow-up at AsPredicted #287368. And calibration on prediction markets is venue-specific: per the V1-M cross-venue paper, the same wallet's calibration relative to baseline differs materially between Polymarket and Manifold, partly because the category mixes differ and partly because the price discovery dynamics differ.

Where the methodology lives

The V1 methodology paper (full validation suite, Fama-French bootstrap null at 10,000 permutations, derivation of skill-Brier and the posture pillar coefficient) is at /research/edge-score-methodology-v1. The cross-venue extension is at /research/edge-score-methodology-v1m. The whale-audit study that documents the Brier-PnL sign flip in the top cohort is at /research/polymarket-whale-audit. Code and reproduction scripts are in the public Convexly repository.

Score a wallet

Paste any Polymarket wallet address at the analyzer to see the wallet's raw Brier, skill-Brier (baseline-adjusted), posture percentile, and the rest of the Edge Score pillars. Free, no signup, reads public on-chain data only.

Convexly publishes new methodology research roughly every 6-8 weeks plus the /learn series on a rolling cadence. Get the next paper in your inbox when it ships:

Frequently asked

What is calibration in forecasting?
Calibration is the property of a probability forecaster whose stated probabilities match the long-run frequency of the predicted outcome. A perfectly calibrated forecaster who says 70% across many forecasts is correct on 70% of those forecasts. The Brier score is the canonical scalar measure of how close a forecaster's stated probabilities come to the realized outcomes. Lower Brier score means better calibration.
How is the Brier score calculated?
The Brier score for a sequence of N binary forecasts is the mean squared error between the stated probability and the realized outcome (coded 0 or 1). For a single forecast of probability p on an outcome o, the contribution is (p - o) squared. A perfect forecast contributes 0; the worst possible forecast contributes 1. Range is [0, 1] for binary outcomes. A naive forecaster who always predicts 0.5 scores 0.25; a forecaster who always predicts the base rate of the event scores roughly equal to the variance of the outcome. The Murphy (1973) decomposition splits the Brier score into reliability (calibration error), resolution (how informative the forecasts are), and uncertainty (irreducible variance of the outcome).
What is a good Brier score?
Context-dependent. On the Good Judgment Project's geopolitical forecasting tournament, the median forecaster scored around 0.20 and the top-2% superforecaster cohort scored around 0.16. On weather forecasting, the National Weather Service's published rain-probability forecasts run around 0.10. Prediction markets are a harder forecasting domain than weather but easier than geopolitics; absolute Brier numbers there depend heavily on the category mix (sports vs politics vs crypto). The more useful comparison for any individual wallet is the baseline-adjusted Brier (skill-Brier), which subtracts the trivial always-predict-the-base-rate baseline so the number is interpretable as 'better or worse than the population-frequency forecaster'.
Why is calibration a weak predictor of profit on Polymarket?
Across the 8,656-wallet Polymarket cohort, Spearman rank correlation between raw Brier score and realized PnL is only +0.148. Two reasons. First, Polymarket PnL is fat-tailed (Hill tail index = 1.28; below the alpha = 2 threshold above which OLS variance is well-behaved), so a few large concentrated positions dominate realized profit. A trader who is well-calibrated but spreads tiny bets across many markets does not capture much of the available edge. Second, among the very top of the leaderboard the relationship inverts: in the whale audit on the top-100-by-PnL wallets, Spearman between Brier and PnL is +0.42, meaning worse-calibrated wallets in the top cohort earn more, because they take riskier positions on lower-probability events that pay larger when they hit.
What is skill-Brier or baseline-adjusted Brier?
Skill-Brier is observed Brier minus the wallet's own marginal-frequency Brier (the Brier a trivial 'always predict the base rate of resolution' forecaster would have scored on the same set of events). Negative means the wallet beats the baseline; positive means the wallet is worse than the baseline. The standardization makes Brier scores comparable across traders who saw different markets with different base rates. In the Convexly Edge Score composite, the posture pillar is the standardized negation of skill-Brier with a coefficient of +0.79, so a more-positive posture (worse skill-Brier) is associated with higher composite rank on the training cohort, the empirical (counterintuitive) finding that motivated the rename from calibration to posture in the V1 paper.
What is the difference between calibration and resolution?
Calibration is the question: when the forecaster says 70%, does the event happen 70% of the time on average? Resolution is the question: does the forecaster's set of probabilities discriminate between events that did happen and events that did not? A forecaster who always says 50% is perfectly calibrated on a 50% base-rate domain but has zero resolution (no information). A forecaster who is well-calibrated AND has high resolution is what you want. Brier score combines both via the Murphy decomposition: BS = Reliability - Resolution + Uncertainty.
Where can I check my own calibration?
For Polymarket wallets specifically, paste the wallet address at /tools/polymarket-wallet-analyzer. The analyzer reports the wallet's raw Brier score, skill-Brier (baseline-adjusted), and posture percentile against the 8,656-wallet reference cohort. For self-forecasting practice on a fresh question set, the Convexly Forecaster's Test at /try is a 10-question calibration quiz that returns a Brier score on a held-out set of resolved binary events; useful for personal benchmarking but not a complete calibration profile.
Why does Convexly rename calibration to posture?
Because on the Polymarket training cohort, the OLS coefficient on z(-skill_brier) in the composite is +0.79, meaning the composite rewards higher (more-positive) standardized skill-Brier. Higher skill-Brier means worse calibration relative to the marginal-frequency baseline. Labeling a pillar 'calibration' while assigning it a coefficient that rewards worse calibration would have been misleading. Posture is the term Convexly adopted in the V1 paper to describe the directionally-aligned interpretation of the same statistic: posture captures whatever is being measured by the sign-aligned component of skill-Brier in the composite, without overclaiming that the pillar tracks forecasting accuracy in the traditional sense.

Related explainers

  • /learn/edge-score: the composite skill measure that uses skill-Brier as one of its three pillars
  • /learn/conviction: what concentration means, how to read a barbell PnL profile (coming soon)
  • /learn/discipline: why position count predicts profit inversely on Polymarket and oppositely on Manifold (coming soon)
  • /learn/kelly: fractional Kelly under fat tails, why full-Kelly is unsafe when alpha < 2 (coming soon)