March 7, 20266 min read

The Brier Score: Necessary but Not Sufficient on Prediction Markets

The Brier score is the right metric for measuring whether a trader's probability estimates are calibrated. It is the wrong metric for deciding whether they are winning. On Polymarket the distinction is not academic. It decides whether a self-assessment routine pays rent.

The definition

Glenn Brier introduced the score in 1950 (Brier, "Verification of forecasts expressed in terms of probability," Monthly Weather Review). It is the mean squared error between a probability forecast and the binary outcome that resolves it. Lower is better; 0 is a perfect forecast, 1 is the worst possible. Formally:

BS = (prediction − outcome)²

For a set of N resolved forecasts, average the per-forecast scores:

BS = (1/N) × Σ (prediction_i − outcome_i)²

Worked example. Three Polymarket positions the trader took a directional view on:

1.Trader posts 90% on a favored incumbent. Incumbent wins. Score: (0.90 − 1)² = 0.01
2.Trader posts 60% on a policy bill passing. It fails. Score: (0.60 − 0)² = 0.36
3.Trader posts 30% on a longshot candidate winning a primary. They do not. Score: (0.30 − 0)² = 0.09

Mean Brier across the three = (0.01 + 0.36 + 0.09) / 3 ≈ 0.153. That number is roughly where Tetlock's superforecasters landed across the Good Judgment Project (Tetlock and Gardner, 2015, Superforecasting).

Benchmarks that matter

0.00

Perfect forecast. Never observed outside degenerate cases.

0.25

The 50/50 baseline. Equivalent to assigning 0.5 to every binary question.

1.00

Full confidence in the wrong outcome every time.

A Polymarket trader posting 0.23 on the Brier score is doing marginally better than the coin-flip baseline. That is enough to avoid the worst sizing errors. It is not a profit signal.

The uncomfortable finding

Convexly scored 8,656 ranked Polymarket wallets across 582,921 resolved positions and published the result in the V1 methodology paper. Spearman rank correlation between per-wallet Brier score and realized signed log PnL is +0.148 across the full sample (Edge Score Methodology V1). Squared, that is roughly 2 percent of profit-rank variance explained by calibration quality.

The direction of the effect holds. Better-calibrated wallets tend toward higher profit rank. The magnitude is what matters. An exercise that tells a trader roughly 2 percent of what they need to know is a diagnostic, not a strategy.

The finding is sharper at the top of the book. Among the top 100 wallets by absolute PnL, the correlation flips negative. The worst-calibrated quartile of whales earns roughly 2.02 times the median profit of the best-calibrated quartile (top-100 whale audit). The traders running the leaderboard are not the most careful forecasters. They are the most concentrated on a small number of asymmetric events.

What Brier is still good for

Brier is a survival metric. A trader in the middle of the distribution (around 0.20 to 0.22) has enough calibration with their own probability estimates that Kelly-family sizing (covered in the Kelly post) will not amplify a systematic overconfidence bias into a wipe-out. That is valuable. It is also measurable once, not weekly.

Check Brier to confirm the trader is not in the bottom quartile of the calibration distribution. Then stop tuning that dial. The Convexly operator takeaway post makes the case directly: weekly calibration drills do not move the profit needle, and the hours are better spent identifying the one or two market categories where the trader has structural edge (the operator's takeaway).

Where Brier does still dominate

Outside prediction markets, Brier remains the right primary metric. The National Weather Service still reports Brier skill scores on precipitation forecasts (NOAA Verification Program documentation). IARPA's Good Judgment Project used it as the headline metric because the question was forecast quality in isolation, not financial return.

The difference is the scoring function Polymarket uses to pay traders. Prediction markets do not pay for calibration; they pay for correctly concentrated sizing on resolved events. Brier answers a different question than the market does, so a high Brier score alone does not produce profit.

See how your wallet scores.

The wallet analyzer reports posture (calibration), conviction (concentration), and discipline percentiles against 8,656 benchmarked Polymarket wallets, then breaks down how each contributed to realized PnL. Free, no signup.

Analyze a wallet Read the methodology paper

Sources.Brier (1950), "Verification of forecasts expressed in terms of probability," Monthly Weather Review. Tetlock and Gardner (2015), Superforecasting. Convexly (2026), Edge Score Methodology V1 and the top-100 whale audit.