Question 1

How do you measure forecasting calibration?

Accepted Answer

Forecasting calibration is measured with three standard tools: (1) the Brier score, which is the mean squared error between stated probability and binary outcome, (2) a reliability diagram, which bins forecasts by stated probability and plots observed win rate against expected win rate per bin, and (3) the Murphy (1973) decomposition of Brier into reliability + resolution + uncertainty. A well-calibrated forecaster has high reliability (stated probabilities match outcome frequencies) and high resolution (forecasts are discriminating across outcome classes).

Question 2

What is a reliability diagram?

Accepted Answer

A reliability diagram plots observed outcome frequency against stated probability, binned into buckets (e.g., 0-10%, 10-20%, ..., 90-100%). A perfectly calibrated forecaster produces points that lie on the 45-degree identity line. Points above the line indicate underconfidence (outcomes happen more often than stated); points below indicate overconfidence (stated probabilities exceed observed frequencies). Convexly's wallet analyzer returns the reliability diagram automatically.

Question 3

What is the Murphy decomposition of Brier score?

Accepted Answer

Brier score decomposes into three additive components (Murphy 1973): BS = reliability - resolution + uncertainty. Reliability is the weighted squared distance between stated probabilities and observed frequencies per bin (lower is better). Resolution is the weighted squared distance between bin-specific frequencies and the overall base rate (higher is better). Uncertainty is the variance of the binary outcome (fixed by the data, not the forecaster). Convexly's V1 paper reports all three components for the 8,656-wallet cohort in §4.

Question 4

Is good calibration enough to be a profitable forecaster?

Accepted Answer

No. Convexly's 10,000-wallet Polymarket study found that Brier score (a pure calibration metric) correlates only +0.148 (Spearman) with realized PnL across the full leaderboard. Among the top 100 wallets by profit, the correlation flips: worse-calibrated wallets earn more (Spearman +0.42). Profit in fat-tailed markets also requires conviction (concentration of position sizing in high-edge bets) and discipline (restraint on position count). Convexly's Edge Score composite combines all three pillars.

Question 5

How do I improve my own calibration?

Accepted Answer

Two standard protocols: (1) Keep a decision journal with stated probabilities attached to every forecast and review the reliability diagram weekly, updating your priors where the bins are off the identity line. (2) Take calibration quizzes with immediate feedback (Convexly has a free 10-question quiz at convexly.app/try). Meta-analysis by Mellers, Tetlock, and colleagues (Good Judgment Project) shows calibration is trainable with 10-20 hours of deliberate practice, particularly for probabilistic forecasts in unfamiliar domains.

How to Measure Forecasting Calibration

1. Brier score

2. Reliability diagram

3. Murphy decomposition

Calibration alone is not the full skill signal

Tools

Related questions