April 16, 2026 · 12 min read · Original research
10,000 Polymarket Wallets Scored. Calibration Barely Predicts Profit.
A calibration audit of the full Polymarket profit leaderboard. 8,656 wallets with 5+ resolved positions. 582,921 total positions. A 10x larger sample than the original 100-wallet audit. Here is what changed, what didn't, and what it means.
Five days ago Convexly published a calibration audit of the top 100 Polymarket profit wallets. Finding: worse calibration correlated with moderately higher profit (Spearman r = +0.42). The strongest criticism: survivorship bias. The prior study was looking at the very top of a fat-tailed distribution and generalizing from 100 data points.
The follow-up extends the analysis to all of them. In this larger study, the top-100 subset replicates at r = +0.39. But the picture across the full leaderboard is different.
10,000 wallets. 583,000 positions. The full profitable leaderboard.
Using the Polymarket Data API's leaderboard endpoint, the study pulled every wallet on the profit leaderboard from rank 1 ($22M, Theo4) to rank 9,997. The leaderboard reports positive profit at every listed rank (min $11,625). Negative PnL in this analysis comes from an independent position-level computation: fills aggregated into volume-weighted positions, resolved against /events, netted. Of the 9,997 wallets, 8,656 had 5 or more resolved positions with valid Brier scores.
Same methodology as the original audit: volume-weighted average entry prices per (market, side) position, resolved against gamma-api /events, Brier-scored per position, aggregated per wallet. Rank correlations (Spearman, Kendall). Bootstrap CIs. Fat-tail-appropriate throughout.
Calibration weakly predicts profit, except at the top 100.
Before the numbers, a scope note. This dataset is the Polymarket profitable leaderboard: wallets that survived long enough and earned enough to rank. Wallets that blew up are gone. Per Taleb and Peters 2019, ensemble averages on a survivor set do not transfer to the time-average experience of an individual trader. Everything below describes cross-sectional skill differences among survivors, not expected returns for any single trader.
Spearman r = +0.148 across all 8,656 wallets [95% CI: +0.128, +0.169]. Statistically significant (p < 1e-40) but rho-squared is 0.022 (~2.2% of rank variance). The positive direction means worse calibration weakly associates with higher profit. The tier analysis tells the real story:
One figure worth holding in mind: the top 1% of wallets by |PnL| captures 36.2% of all signed profit across the leaderboard. Roughly 83 wallets out of 8,656 account for a third of everything. Whatever separates them from everyone else, it is not the accuracy of their probability estimates. 85% of ranked traders already beat their own base-rate Brier (a weak benchmark: any wallet that sizes away from coin-flip markets clears it without forecasting skill).
| Tier | n | Med. Brier | Skill+ | Spearman r | Med. PnL |
|---|---|---|---|---|---|
| Top 100 | 100 | 0.20 | 63% | +0.39 | $700K |
| 101-250 | 150 | 0.16 | 81% | -0.02 | $271K |
| 251-500 | 250 | 0.13 | 88% | +0.03 | $143K |
| 501-1000 | 500 | 0.13 | 88% | -0.00 | $80K |
| 1001-2500 | 1,500 | 0.14 | 81% | -0.07 | $33K |
| 2501-5000 | 2,500 | 0.16 | 77% | +0.12 | $13K |
| 5001+ | 3,656 | 0.11 | 94% | -0.07 | -$338 |
The only tier where calibration correlates with profit is the top 100 (r = +0.39). Everywhere else, the correlation is essentially zero. Tier cut-points (100, 250, 500, 1000, 2500, 5000) are arbitrary conventions; they were chosen before running the analysis and the qualitative finding is robust to moving them within ~20%.
Run this on a real wallet
Convexly built the tool that produced these numbers. It runs on any Polymarket address in 30 seconds, no signup. See it on Theo4 (rank 1, $22M PnL) first, then try your own.
Low Brier does not cluster with high profit.
A caveat first, because this is the single most attackable frame: Tier 7 (ranks 5001+) sits at the bottom of the profit leaderboard by construction, so negative median PnL is partly definitional. What is not definitional is their calibration quality, which is measured independently of PnL.
Tier 7 has 3,656 wallets. They have the best calibration of any tier: median Brier 0.11, 94% beat their own base rate. 62% have negative computed realized PnL.
The top 100 earn a median $700K with a median Brier of 0.20, the worst calibration of any tier on this leaderboard. The best-calibrated tier has median realized PnL of -$338. What separates the top 100 from the rest is not the accuracy of their probability estimates.
What actually predicts profit
Three factors dominate, all with caveats:
1. Sizing (read: barbell). The worst-calibrated quartile's median PnL ($14,437) is 8.2x the best-calibrated quartile's ($1,766). Driven by the top 100, who cluster in the worst-calibrated quartile because they make a small number of asymmetric bets. This is a Taleb barbell (Antifragile, 2012): most capital in small positions, rare large bets with highly asymmetric payoffs. Empirically, the top 1% of wallets by |PnL| capture 36.2% of total signed profit across the entire leaderboard. The 8.2x figure is best treated as an illustration of barbell concentration at the top, not as a universal sizing law. Outside the top 100, the ratio compresses substantially.
2. Concentration. Two distinct metrics, both large. The median wallet's biggest single event accounted for 66.2% of their PnL. Separately, 69.9% of wallets derived the majority (>50%) of their PnL from a single event. This is not diversification. This is one directional bet that paid off.
3. Position count. The top 100 have a median of 18 resolved positions. Tier 3 has 35. Tier 4 has 31. Tier 7 has 49. The most profitable tier makes fewer, larger bets.
Taleb predicted this. Here is the data.
In 2020, Nassim Taleb published "On the Statistical Differences between Binary Forecasts and Real World Payoffs" (International Journal of Forecasting 36(4)). The core claim: "Being a 'good forecaster' in binary space doesn't lead to having a good actual performance. A binary forecasting record is likely to be a reverse indicator under some classes of distributions." Taleb's argument: Brier and similar scoring rules are thin-tailed summaries of a fat-tailed payoff distribution. Being well-calibrated in probability space says essentially nothing about realized PnL.
This dataset is an empirical confirmation on 8,656 Polymarket wallets. 85.4% of leaderboard-ranked wallets beat their own base-rate Brier, though the own-base-rate benchmark is a weak one by construction: any wallet that concentrates on markets away from 0.5 implied probability clears it without any forecasting skill. The sharper picture comes from the calibration-vs-profit relationship itself, which is a weak Spearman r = +0.148 across the full sample, rho-squared about 0.022, with 62% of the best-calibrated tier holding negative realized PnL. Scope caveat: this applies to leaderboard-ranked wallets only, not to the long tail of Polymarket users who never reach the leaderboard.
Skill doesn't map to profit the way intuition suggests.
The prediction market profit function is not profit = f(calibration). At the full-sample level it is closer to profit = f(sizing, concentration, event_selection) + noise. Calibration matters, but among ranked wallets it is not what separates the profitable from the rest.
What this means for traders
If your wallet is on the leaderboard and you're not in the top 100, calibration is probably not your bottleneck. 85% of ranked wallets are already well-calibrated. Your bottleneck is more likely one of:
Sizing. The top earners in this dataset are distinguished by larger, more concentrated positions, not by better calibration. Whether that reflects Kelly-optimal sizing or simply more capital is a separate question the data here doesn't settle. Looking at your actual position sizing pattern against the implied-odds reference is a starting point, not a prescription.
Category selection. You might be well-calibrated on politics but badly calibrated on crypto. Without category-level Brier tracking, you don't know where your edge actually is.
Concentration. The top earners concentrate. You might be diversifying away your edge by spreading across too many markets.
So what
Three working hypotheses follow from this data. Each is a lever an individual trader can actually pull, and each is currently invisible in the default Polymarket / Kalshi interface.
1. Calibrated traders are polishing the wrong dial. 85% of ranked wallets are already well-calibrated, and 62% of the best-calibrated tier is losing money. Spending another month drilling Brier drills does not move PnL in this dataset. Knowing which category your calibration is real in, and which it is not, does. A wallet that is +0.08 on politics and -0.05 on crypto should not be sizing the two the same way, and most traders cannot see that split without a tool that surfaces it.
2. Concentration is a feature, not a bug, but nobody can see their own. The median wallet derives 66.2% of PnL from a single event; 69.9% of wallets are above 50% on one event. The top 100 are even more concentrated, and concentration loads on PnL more than any other single factor measured in this study. In practice this means two things: first, most traders are already taking barbell risk, whether they know it or not; second, none of them can watch their concentration drift in real time. A dashboard that shows "67% of your PnL is currently riding on one event" changes decisions.
3. The top 1% is the market-maker class. The other 99% need a different playbook. If 83 wallets capture 36% of signed profit, and the next 1,000 capture most of the remainder, that is not a distribution where copying whale trades gets retail users into the black. It is a distribution where retail users need to know where their own edge is and size to it. The whale-copying tools (Hashdive, Polysights, PolyWallet) optimize for the wrong layer of the market. What is missing is a calibration and sizing diagnostic for the 99%.
Run it on a real wallet
Convexly built the tool that produced every number in this post. It runs on any Polymarket address in 30 seconds and shows category-level Brier, position sizing diagnostics, and concentration share. Free. No signup.
Theo4 is the rank-1 all-time Polymarket profit wallet ($22M). Seeing a real profile first means you know what you're getting before pasting your own address.
Why this research is public
Prediction markets work better when their traders understand what they are doing. A venue where 62% of well-calibrated traders lose money and most of them quit within a year is not a mature market; it is a funnel. A venue where those same traders can see their per-category edge, their concentration drift, and their sizing pattern in real time retains them longer and gets more resolved volume. Calibration and sizing transparency is venue-positive.
Polymarket and Kalshi both run builder programs for third-party tools that increase trader sophistication. This analysis, the free wallet analyzer, and the roadmap above are Convexly's contribution. Data from the Polymarket Data API, methodology open, numbers dual-verified, all of it public. If you run a prediction-market venue and want to talk about integration, reach out via the footer link.
Methodology
Rank-based statistics and bootstrap throughout. Every number in this post traces to a canonical fact sheet (docs/verified/fact-sheet-10k-2026-04-16.md) that was computed twice independently (pandas and csv+manual) before being published.
On prior work
A reasonable reader may raise Walsh & Joshi (2023, arXiv:2303.06021), which argues calibration-based model selection delivers +34.69% ROI versus -35.17% for accuracy-based selection in sports betting. That result concerns model-level selection criteria, not trader-level realized PnL on a prediction market. The dependent variables and the decision objects are different. Their finding is compatible with this one: choosing a better-calibrated probability model is not the same claim as being a better-calibrated trader who then takes profitable positions.
Fat-tail diagnostics
Polymarket realized PnL is fat-tailed. The study measured it. Hill estimator on the positive tail gives α ≈ 1.26 [95% CI 1.17, 1.35]; negative tail α ≈ 1.43 [1.25, 1.61]. (Tail estimates use the 9,727-wallet PnL-nonnull sample, which does not require the 5-position Brier filter applied to the 8,656-wallet headline sample. The positive-tail α is robust across both.) Both lie in the (1, 2] regime: finite mean, infinite variance. Pearson correlation, OLS R², and CLT-based confidence intervals are formally invalid on this distribution (see Taleb, Statistical Consequences of Fat Tails (2020), Chapter 3). Spearman rank is the appropriate statistic and is what the study reports throughout.
Under α ≈ 1.26, the effective tail sample size is approximately Nα/2 ≈ 370, not 8,656. Bulk metrics (medians, percentages, Spearman) are tight; tail magnitudes are governed by ~80 wallets. The top 1% of |PnL| captures 36.2% of total signed PnL across the leaderboard. The Gabaix-Ibragimov (2011) cross-check gives α+ = 1.28, α- = 1.56, consistent with the Hill estimates. Zipf plot and full diagnostic report are saved at services/api/scripts/output/fat_tail_report.txt.
One consequence: Kelly sizing under α < 2 with uncertain edge parameters is a ruin machine. Fractional (half) Kelly with tail-shaving is the appropriate response. See MacLean, Thorp & Ziemba (2011), The Kelly Capital Growth Investment Criterion: Theory and Practice (World Scientific) for the practitioner case that full Kelly is too aggressive even with known edge, and Peters (2019) on the divergence between ensemble-average and time-average growth under fat tails.
Meta-bootstrap (40 outer × 500 inner resamples) on the Brier-PnL Spearman confirms the 95% CI is stable (width = 0.038, std of width = 0.0016). Bootstrap is well-behaved for Spearman on this sample.
Measurement choices
- Data source: Polymarket Data API leaderboard endpoint + trade history + gamma-api /events for resolution.
- Sample: 9,997 wallets from the profit leaderboard (full population of the leaderboard, not a sample). 8,656 with 5+ resolved positions. 582,921 total resolved positions.
- Position aggregation: Volume-weighted average price per (conditionId, outcome_index). Fills collapsed into positions. Closed-before-resolution positions excluded.
- Brier score: (p - o)^2 per position, averaged per wallet. Base-rate skill computed per wallet against the wallet's own marginal frequency, not a global prior.
- Category tagging: Keyword classifier on event titles and tags. Categories: politics, sports, crypto, economics, entertainment, science, other. Classifier is deterministic but not independently validated; treat category analysis as exploratory.
- Correlations: Spearman rank (distribution-free). Bootstrap 95% CI on 10,000 resamples, seed = 42. Dual-verified by two independent internal computations (pandas and csv+manual). This is internal dual-verification, not external peer review.
- Fat-tail verification: Hill alpha on PnL distribution (fat-tailed as expected for leaderboard data).
- Leaderboard scope: Reflects wallets with enough profitable activity to rank in Polymarket's public leaderboard. Wallets that stopped trading, were deactivated, or never reached the leaderboard threshold are not captured. Survivor effects are present.
- Reproducibility: Raw CSV outputs available on request (email at the footer link).
Published by Convexly. Analysis computed using the Convexly statistical engine. Convexly (convexly.app) is a calibration and sizing intelligence tool for prediction market traders. The methodology in this analysis is the same methodology used in the product. Every number here was independently verified against a canonical fact sheet before being quoted publicly. Data compiled with Anthropic's Claude. Statistical analysis and writing are Convexly Research's own.