Edge Score Methodology V1
A composite skill measure for prediction-market traders. Published 2026-04-18. Byline: Convexly Research.
Abstract
We report a composite scoring layer for prediction-market traders fit on a frozen cohort of 8,656 Polymarket wallets with at least five resolved positions. The score, Edge Score V3b, combines three standardized predictors: a posture term derived from baseline-adjusted Brier score, a conviction term derived from PnL concentration in the wallet's single largest event, and a discipline term derived from resolved position count. Under a 5-fold cross-validation with fold-local coefficient refit, the composite achieves an out-of-fold Spearman rank correlation of +0.514 with signed log PnL, against +0.147 for a Brier-only baseline.
A Fama-French 2010 bootstrap null with 10,000 PnL permutations places the observed Spearman outside every permuted sample, one-sided p < 0.0001. Subgroup stability holds on six cross-sections. Hill alpha on realized PnL is 1.28 (95% CI 1.20 to 1.36), so the composite ranks median outcomes rather than expected returns. Two pre-registered experiments requiring per-position outcome data are deferred to V1.5.
Validation summary (five of seven)
| # | Experiment | Result |
|---|---|---|
| E1 | 5-fold CV (V3b), fold-local refit | Spearman +0.514 |
| E2 | Per-wallet temporal holdout | Deferred to V1.5 (data) |
| E3 | Subgroup stability (6 cuts) | Range +0.468 to +0.726 |
| E4 | V0-V5 formula sensitivity | V3b kept (market-selection grounds) |
| E5 | Fat-tail Hill α | α = 1.28 (CI 1.20, 1.36) |
| E6 | Bootstrap null, 10,000 permutations | p < 0.0001 |
| E7 | IC temporal stability | Deferred to V1.5 (data) |
E2 and E7 require per-position outcome data that is not present in the current Polymarket positions extraction. A one-time data pipeline extension, or a cross-venue replication on a cohort where outcomes are accessible (e.g. Manifold), unblocks both experiments. V1.5 of the paper.
The three pillars
Posture z(-skill_brier) · 0.7876. Rewards wallets whose profit is not driven by precise calibration. On the Polymarket leaderboard, the top-100 wallets by profit are in the worst Brier quartile and account for the majority of realized PnL. The pillar does not measure forecasting skill; it measures whether the trader makes money while calibration is imprecise.
Conviction z(log concentration) · 2.7220. Concentration is the share of realized PnL attributable to the wallet's single largest event. High Conviction = barbell concentration.
Discipline z(log n_positions) · -1.1508. Negative loading on position count. Fewer, larger bets score higher. The training cohort's most profitable wallets hold fewer resolved positions than average.
Related work
Wilson (2023), “Hedge Funds With(out) Edge” (SSRN 4513205). Defines a skill measure also called “Edge” for equity hedge funds based on VIX-short loading. Different problem, different features, different benchmark; name collision only. Cited and differentiated.
Forsberg, Gallagher, Warren (2021). Peer-cohort persistence framework for hedge fund skill. Our temporal-holdout experiment (deferred to V1.5) extends this to prediction-market wallets.
Fama and French (2010). Canonical bootstrap null for skill-versus-luck. Our E6 replicates the exact protocol with Edge Score in place of mutual fund alpha. p < 0.0001 on 10,000 permutations.
López de Prado (2018). Purging and embargo for time-overlapping rank signals. Applied to the deferred E2 when per-position outcome data is available.
Augenblick and Rabin (2021, QJE). Excess belief movement on prediction-market platforms. Cited to motivate why raw Brier is an insufficient skill statistic and why the composite adds conviction and discipline.
Responses to anticipated critiques
The six objections below are the ones a sophisticated reader will raise against V1. Each is stated in the sharpest form, followed by Convexly Research's response. Critiques #2 and #5 are addressed directly in the paper's Limitations section (§7); the others are covered here and partially in existing §7 bullets.
1. “n = 8,656 is a single draw from a fat-tailed population. Findings don't generalize under Hill α = 1.28.”
The paper does not claim universal generalization. Every statistical claim is scoped to the Polymarket profit leaderboard cohort at 2026-04-15 and its out-of-fold structure. V1.5 is pre-registered to replicate the composite on Kalshi and Manifold cohorts once per-position outcome data is available. Until V1.5 ships, the honest read of V1 is “this is what the leaderboard-filtered Polymarket population looks like,” not “this is what prediction-market skill looks like in general.”
2. “OOF Spearman +0.514 is a point estimate on an infinite-variance target. Confidence intervals are suspect.”
Spearman is a rank statistic. Its sampling distribution is computed over the empirical ranks of the joint distribution, which are bounded by construction. The sampling distribution of Spearman is therefore well-behaved even when the underlying variable (realized PnL) has infinite theoretical variance. Bootstrap confidence bands reported in §5.1 and §5.4 of the paper are on Spearman itself; the paper does not report Pearson correlation, OLS R², or any t-statistic on realized PnL. Parametric moment-based inference would fail under α = 1.28; rank-based inference does not. Spelled out in paper §7.
3. “Survivorship bias. The leaderboard is a positive-profit filter.”
Acknowledged directly in paper §7 as the first limitation. The cohort describes differences among survivors, not expected outcomes for a randomly drawn new trader. Peters (2019) on ergodicity is the relevant reference. V1.5 adds a supplementary cohort of active-but-unranked wallets to measure the selection shift explicitly.
4. “The V3b-over-V1 tiebreaker was tested on the same training cohort. Circular.”
Correct, and acknowledged in paper §5.4 and §7. V3b is the shipped composite because its first feature (baseline-adjusted Brier) is structurally harder to game than V1's raw Brier, not because it has out-beaten V1 in a blinded test. The definitive test is a held-out cohort with deliberately injected market-selection gaming, pre-registered as a V1.5 experiment. Today, the preference is a principled methodological choice, not a validated empirical one.
5. “The Fama-French bootstrap assumes permutability. Under fat tails, permutations don't preserve the tail structure.”
The bootstrap in §5.6 permutes wallet-PnL labels and recomputes Spearman against the held-out composite. Under the null hypothesis that Edge Score has no association with PnL, wallet-PnL pairings are exchangeable by construction. The permutation distribution preserves the tail structure of the marginal PnL distribution because the operation permutes labels, not values. The tail shape is determined by the PnL marginal and is fixed across every permutation; only the wallet-to-PnL mapping varies. This is the standard Fama-French (2010) protocol applied without modification. Spelled out in paper §7.
6. “Sizing and concentration are endogenous to conviction. You've measured chicken-and-egg.”
Concentration as defined in §3 is the share of realized PnL attributable to the wallet's single largest event. This is a sizing-independent ratio. A wallet that sized uniformly across 100 bets and got lucky on one large winner has high concentration but low sizing; a wallet that sized every bet at $10K uniformly has roughly equal stakes but can have low concentration. The composite does not use total_risked or any capital proxy, precisely to avoid conflating conviction with capital. Endogeneity between concentration and conviction is real but operates through outcome realization, not through model construction.
This section is treated as a living document. If a sharper version of any critique arrives by email, this addendum will be updated with the stronger form and the reply. Updates are dated; V1 itself is frozen.
Methodology review welcomed
If you work on prediction markets, forecasting, or fund-manager skill measurement and you want to review the methodology, we would value the feedback. V1 of the paper is the current artifact; V1.5 will fold in Kalshi or Manifold cross-venue replication once per-position outcome data is available.
Reach out: research@convexly.app.
Reproducibility
Code, raw validation outputs, anonymized cohort CSV (addresses hashed), frozen coefficients, and reference standardization constants are available on request. The pre-registration text and pass-fail thresholds were committed before the validation script ran; revisions to the pre-registration are timestamped and available for inspection. All random seeds are set to 42.
Run Edge Score on a real wallet
The scoring layer is live. Paste any Polymarket wallet address and see its Edge Score against the reference cohort in 30 seconds.
Open the analyzer