How to Measure Forecasting Calibration

Forecasting calibration is measured with three standard tools: the Brier score, reliability diagrams binned by stated probability, and the Murphy (1973) decomposition into reliability + resolution + uncertainty. A well-calibrated forecaster assigns probabilities that match outcome frequencies, and those probabilities discriminate across outcome classes. Brier score captures the first property; resolution captures the second.

1. Brier score

For N binary predictions with probabilities p_i and outcomes o_i (0 or 1):

BS = (1 / N) · Σ (p_i − o_i)²

The score ranges from 0 (perfect) to 1 (systematically wrong in the wrong direction). Always predicting 0.5 on a 50/50 market returns Brier 0.25. Good Judgment Project superforecasters average near 0.10 on broad geopolitical questions. Convexly's 8,656-wallet Polymarket cohort has median Brier ~0.188 with top quartile below 0.105.

Brier is a proper scoring rule: the expected Brier is minimized only when the forecaster states their true subjective probability. Any dishonest shading makes Brier worse in expectation.

2. Reliability diagram

Bin forecasts by stated probability (e.g., 10 equal-width bins from 0 to 1). For each bin, plot the observed win rate (y-axis) against the midpoint stated probability (x-axis). A perfectly calibrated forecaster produces points on the 45-degree identity line. Points above the line indicate underconfidence; points below indicate overconfidence.

Most retail traders overconfident in their 80-90% bucket: they state 85% and win 70%. The reliability diagram lets you see exactly where your calibration fails.

3. Murphy decomposition

Brier decomposes additively (Murphy 1973):

BS = reliability − resolution + uncertainty

  • Reliability (lower better) is the weighted squared distance between stated probabilities and observed frequencies per bin
  • Resolution (higher better) is the weighted squared distance between bin frequencies and the overall base rate. It rewards forecasts that discriminate.
  • Uncertainty is the outcome variance, fixed by the data

Two forecasters with identical Brier can have very different decompositions. Forecaster A might have low reliability (well-calibrated on each bin) but low resolution (all forecasts cluster near the base rate). Forecaster B might have higher reliability error but more resolution. The decomposition distinguishes them.

Calibration alone is not the full skill signal

Convexly's 10,000-wallet Polymarket study showed that Brier (pure calibration) correlates only +0.148 (Spearman) with realized PnL across the leaderboard. A well-calibrated forecaster who never concentrates position size on high-edge bets does not capture fat-tail upside. Convexly's Edge Score composite adds two pillars (conviction and discipline) to calibration and achieves OOF Spearman +0.514, with a Fama-French bootstrap null rejected at p < 0.0001.

Calibration is necessary, not sufficient. Measuring it cleanly still matters.

Tools

  • Calibration Challenge: 10-question quiz with a reliability diagram of your results. Free, no signup. 2 minutes.
  • Brier Score Calculator: enter your own probability-outcome pairs, get Brier + percentile rank against 8,656 Polymarket wallets.
  • Polymarket Wallet Analyzer: full Brier + reliability diagram + Edge Score for any Polymarket wallet address.

Take the calibration quiz

Free, 10 questions, 2 minutes. Reliability diagram of your results.

Start the quiz

Related questions

What is a Brier score for prediction markets?

How do you distinguish skill from luck in trading PnL?

Is there a free Polymarket calibration tool?