Published Version 1.0

Edge Score Methodology V1

A composite skill measure for prediction-market traders. Published 2026-04-18. Byline: Convexly.

Methodology update 2026-04-27

The V1.5 deferred experiments E2 (per-wallet temporal holdout) and E7 (per-quarter IC stability) ran on the V1-M position tape. Both pre-registered primary tests failed their ex-ante thresholds. The +0.514 OOF Spearman is in-sample cross-wallet; per-wallet temporal holdout produced ρ = +0.111 (95% CI [+0.046, +0.175], N=805 paired wallets), well below the pre-registered ρ ≥ +0.30 threshold. Honest reframing: V3b is a cross-sectional ranker of wallet behavior, not a per-wallet temporal predictor. The rolling +0.391 mean across 26 windows is an in-sample contemporaneous diagnostic (score and PnL share the same in-window positions; conviction is PnL-derived), not a forward test; it overstates predictive skill. The forward result is the V1.5 per-wallet temporal holdout: out-of-sample Spearman +0.11 (95% CI [0.05, 0.18]), below the +0.30 ex-ante threshold. Full V1.5 result at /research/edge-score-methodology-v1-5. Methodology disclosure with all caveats at /methodology.

Download paper (PDF, 13 pages)Request methodology review

Key findings (TL;DR)

1.Calibration alone is a weak skill ranker. Out-of-fold Spearman rank correlation between Brier-only ranking and signed log PnL is +0.147. Edge Score V3b - the three-pillar composite - pulls that to +0.514 on the same cohort.
2.Frozen coefficients, versioned validation, reproducible. Coefficients (Posture +0.79, Conviction +2.72, Discipline −1.15) committed to git before validation. Fama-French 2010 bootstrap null at 10,000 permutations: p < 0.0001. Hill alpha on per-wallet PnL = 1.28 (CI 1.20-1.36) - fat-tailed, so all inference rank-based.
3.Public data bundle, single-script reproduction. 8,656 wallets with frozen per-wallet statistics, the validation report, and a stdlib-only Python script that re-pulls the cohort and recomputes the composite. No proprietary inputs required.

What this means for traders: your Brier score alone tells you almost nothing about whether your edge is real. The Edge Score V3b composite combines three signals that empirically correlate with realized profit on the V1 training cohort. V1.5 explicitly tested per-wallet temporal predictive power and reported both pre-registered primary tests failed; treat the composite as a cross-sectional skill ranker, not a per-wallet forward forecast. Run yours via the free wallet analyzer.

Abstract

We report a composite scoring layer for prediction-market traders fit on a frozen cohort of 8,656 Polymarket wallets with at least five resolved positions. The score, Edge Score V3b, combines three standardized predictors: a posture term derived from baseline-adjusted Brier score, a conviction term derived from PnL concentration in the wallet's single largest event, and a discipline term derived from resolved position count. Under a 5-fold cross-validation with fold-local coefficient refit, the composite achieves an out-of-fold Spearman rank correlation of +0.514 with signed log PnL, against +0.147 for a Brier-only baseline.

A Fama-French 2010 bootstrap null with 10,000 PnL permutations places the observed Spearman outside every permuted sample, one-sided p < 0.0001. Subgroup stability holds on six cross-sections. Hill alpha on realized PnL is 1.28 (95% CI 1.20 to 1.36), so the composite ranks median outcomes rather than expected returns. Two ex-ante validation experiments requiring per-position outcome data are deferred to a follow-up paper (V1.5).

Validation summary (five of seven)

#	Experiment	Result
E1	5-fold CV (V3b), fold-local refit	Spearman +0.514
E2	Per-wallet temporal holdout	Deferred to V1.5 (data)
E3	Subgroup stability (6 cuts)	Range +0.468 to +0.726
E4	V0-V5 formula sensitivity	V3b kept (market-selection grounds)
E5	Fat-tail Hill α	α = 1.28 (CI 1.20, 1.36)
E6	Bootstrap null, 10,000 permutations	p < 0.0001
E7	IC temporal stability	Deferred to V1.5 (data)

E2 and E7 require per-position outcome data that is not present in the current Polymarket positions extraction. A one-time data pipeline extension, or a cross-venue replication on a cohort where outcomes are accessible (e.g. Manifold), unblocks both experiments. V1.5 of the paper.

The three pillars

Posture z(-skill_brier) · 0.7876. Rewards wallets whose profit is not driven by precise calibration. On the Polymarket leaderboard, the top-100 wallets by profit are in the worst Brier quartile and account for the majority of realized PnL. The pillar does not measure forecasting skill; it measures whether the trader makes money while calibration is imprecise.

Conviction z(log concentration) · 2.7220. Concentration is the share of realized PnL attributable to the wallet's single largest event. High Conviction = barbell concentration.

Discipline z(log n_positions) · -1.1508. Negative loading on position count. Fewer, larger bets score higher. The training cohort's most profitable wallets hold fewer resolved positions than average.

Related work

Wilson (2023), “Hedge Funds With(out) Edge” (SSRN 4513205). Defines a skill measure also called “Edge” for equity hedge funds based on VIX-short loading. Different problem, different features, different benchmark; name collision only. Cited and differentiated.

Forsberg, Gallagher, Warren (2021). Peer-cohort persistence framework for hedge fund skill. Our temporal-holdout experiment (deferred to V1.5) extends this to prediction-market wallets.

Fama and French (2010). Canonical bootstrap null for skill-versus-luck. Our E6 replicates the exact protocol with Edge Score in place of mutual fund alpha. p < 0.0001 on 10,000 permutations.

López de Prado (2018). Purging and embargo for time-overlapping rank signals. Applied to the deferred E2 when per-position outcome data is available.

Augenblick and Rabin (2021, QJE). Excess belief movement on prediction-market platforms. Cited to motivate why raw Brier is an insufficient skill statistic and why the composite adds conviction and discipline.

Responses to anticipated critiques

The six objections below are the ones a sophisticated reader will raise against V1. Each is stated in the sharpest form, followed by Convexly's response. Critiques #2 and #5 are addressed directly in the paper's Limitations section (§7); the others are covered here and partially in existing §7 bullets.

1. “n = 8,656 is a single draw from a fat-tailed population. Findings don't generalize under Hill α = 1.28.”

The paper does not claim universal generalization. Every statistical claim is scoped to the Polymarket profit leaderboard cohort at 2026-04-15 and its out-of-fold structure. V1.5 is planned to replicate the composite on the Manifold cohort once per-position outcome data is available. Until V1.5 ships, the accurate read of V1 is “this is what the leaderboard-filtered Polymarket population looks like,” not “this is what prediction-market skill looks like in general.”

2. “OOF Spearman +0.514 is a point estimate on an infinite-variance target. Confidence intervals are suspect.”

Spearman is a rank statistic. Its sampling distribution is computed over the empirical ranks of the joint distribution, which are bounded by construction. The sampling distribution of Spearman is therefore well-behaved even when the underlying variable (realized PnL) has infinite theoretical variance. Bootstrap confidence bands reported in §5.1 and §5.4 of the paper are on Spearman itself; the paper does not report Pearson correlation, OLS R², or any t-statistic on realized PnL. Parametric moment-based inference would fail under α = 1.28; rank-based inference does not. Spelled out in paper §7.

3. “Survivorship bias. The leaderboard is a positive-profit filter.”

Acknowledged directly in paper §7 as the first limitation. The cohort describes differences among survivors, not expected outcomes for a randomly drawn new trader. Peters (2019) on ergodicity is the relevant reference. V1.5 adds a supplementary cohort of active-but-unranked wallets to measure the selection shift explicitly.

4. “The V3b-over-V1 tiebreaker was tested on the same training cohort. Circular.”

Correct, and acknowledged in paper §5.4 and §7. V3b is the shipped composite because its first feature (baseline-adjusted Brier) is structurally harder to game than V1's raw Brier, not because it has out-beaten V1 in a blinded test. The definitive test is a held-out cohort with deliberately injected market-selection gaming, planned as a V1.5 follow-up experiment. Today, the preference is a principled methodological choice, not a validated empirical one.

5. “The Fama-French bootstrap assumes permutability. Under fat tails, permutations don't preserve the tail structure.”

The bootstrap in §5.6 permutes wallet-PnL labels and recomputes Spearman against the held-out composite. Under the null hypothesis that Edge Score has no association with PnL, wallet-PnL pairings are exchangeable by construction. The permutation distribution preserves the tail structure of the marginal PnL distribution because the operation permutes labels, not values. The tail shape is determined by the PnL marginal and is fixed across every permutation; only the wallet-to-PnL mapping varies. This is the standard Fama-French (2010) protocol applied without modification. Spelled out in paper §7.

6. “Sizing and concentration are endogenous to conviction. You've measured chicken-and-egg.”

Concentration as defined in §3 is the share of realized PnL attributable to the wallet's single largest event. This is a sizing-independent ratio. A wallet that sized uniformly across 100 bets and got lucky on one large winner has high concentration but low sizing; a wallet that sized every bet at $10K uniformly has roughly equal stakes but can have low concentration. The composite does not use total_risked or any capital proxy, precisely to avoid conflating conviction with capital. Endogeneity between concentration and conviction is real but operates through outcome realization, not through model construction.

This section is treated as a living document. If a sharper version of any critique arrives by email, this addendum will be updated with the stronger form and the reply. Updates are dated; V1 itself is frozen.

Methodology review welcomed

If you work on prediction markets, forecasting, or fund-manager skill measurement and you want to review the methodology, we would value the feedback. V1 of the paper is the current artifact; V1.5 will fold in Manifold cross-venue replication once per-position outcome data is available.

Reach out: research@convexly.app.

Reproducibility

Code, raw validation outputs, anonymized cohort CSV (addresses hashed), frozen coefficients, and reference standardization constants are available as a direct download in the V1-M public data bundle. The ex-ante methodology document and pass-fail thresholds were committed to a version-controlled repository before the validation script ran; the commit history provides the timestamped audit trail. We did not file an external pre-registration with a third-party registry (OSF, AsPredicted, AEA) for V1; internal ex-ante commitment is the standard the V1 paper meets. The deferred V1.5 experiments (E2 + E7) ARE externally pre-registered at AsPredicted #287368 (filed 2026-04-25, anonymous). All random seeds are set to 42.

Convexly publishes new methodology research roughly every 6-8 weeks. Get the next paper in your inbox when it ships:

AI tooling disclosure

This work used AI tools (Claude, GPT-4) as research aids during methodology design and pre-publication review. All claims, statistical results, and figures are reproducible from the public data bundle and the frozen-commit code at the linked V1-M data bundle. No claim on this page is taken as true on the basis of an AI tool's output; every quantitative result is recomputable from the bundle with the documented seed.