Methodology · last updated 2026-04-27
What Convexly's methodology is, and what it isn't
This page summarizes the Edge Score methodology stack (V1, V1-M, V1.5, V2.8.2, forward-validation), the validation results against frozen coefficients, the explicit limitations, and what each pre-registered follow-up actually found.
Update 2026-04-27
V1.5 deferred experiments E2 (per-wallet temporal holdout) and E7 (per-quarter IC stability) ran on the V1-M position tape. Both pre-registered primary tests failed their ex-ante thresholds. V2.8.2 (24-aggregator sweep on V1-M) finds no aggregator beats the market-implied baseline. The honest reframing: V3b is a cross-sectional skill ranker that holds up across in-sample OOF, forward-validation, and partial-correlation control for capital. It is not a per-wallet temporal predictor and it is not a forecast-aggregation weight. Full results in the V1.5 and V2.8.2 sections below.
Edge Score V3b composite
The composite is a frozen-coefficient three-pillar weighted sum. Coefficients (Posture +0.7876, Conviction +2.7220, Discipline -1.1508) were committed to a version-controlled repository before the validation script ran. Validation produced an out-of-fold Spearman rank correlation of +0.514 between Edge Score and signed log realized PnL on the V1 cohort. A Brier-only baseline produces +0.147 on the same cohort.
Each pillar is z-score standardized against the V1 reference cohort. Posture rewards baseline-adjusted Brier (negated), Conviction rewards PnL concentration on the largest event, Discipline penalizes total resolved-position count (fewer larger bets score higher).
Full derivation, OLS fit, OOF folds, and every intermediate statistic is in the V1 paper.
Frozen-coefficient vs externally pre-registered
V1 is internally pre-registered. The ex-ante methodology document and pass-fail thresholds were committed to git before the validation script ran. The commit history provides the timestamped audit trail. This is internal pre-registration, not external pre-registration with OSF, AsPredicted, or AEA.
V1.5 was externally pre-registered, ran on schedule, and is published. The two ex-ante validation experiments that require per-position outcome data (per-wallet temporal holdout E2, per-quarter Information Coefficient stability E7) were filed at AsPredicted #287368 on 2026-04-25 and ran on 2026-04-27. Both primary tests failed their ex-ante pass criteria. The full result is in the V1.5 section below and at the dedicated V1.5 paper.
V2.8.2 is externally pre-registered. AsPredicted #287436 (in-sample) and #287442 (forward-only), filed before any analysis code ran. Three frozen amendments to V2.8.2 are committed to the repository: one documenting the closure-caps choice, one substituting the V1-M reference cohort after the original two-hop closure plan exceeded available API capacity (filed at AsPredicted #287714), and the audit-trail amendment for the gate-script v1m_cohort_substitution code path. PRIMARY hypothesis test ran on the V1-M cohort 2026-04-27; result is in the V2.8.2 section below.
Validation results
- In-sample cross-wallet OOF Spearman: +0.514
- Calibration-only baseline Spearman: +0.147
- Fama-French (2010) bootstrap null at 10,000 permutations: observed Spearman outside every permuted sample, p < 0.0001
- Subgroup stability across six cuts: Spearman range +0.468 to +0.726
- Forward-validation rolling 30-day Spearman across 26 monthly windows on the V1-M cohort, 2025-10-01 → 2026-04-25: mean +0.391 (median +0.375). All 26 windows finished above the +0.147 Brier-only baseline. Worst window +0.299, best window +0.520. See the live forward-validation dashboard.
- V1.5 per-wallet temporal holdout (frozen V1 coefficients, N=805 paired wallets): Spearman = +0.111 (95% CI [+0.046, +0.175]). Pre-reg threshold ρ ≥ +0.30. Failed the pre-registered pass criterion. Positive and significant, but well below the threshold.
- V1.5 partial Spearman of V3b vs PnL controlling for log absolute capital (N=7,805): +0.494 (95% CI [+0.477, +0.512]). Stronger than the marginal +0.322; capital is a suppressor, not a confounder.
- Hill tail index alpha on realized PnL: 1.28 (95% CI 1.20-1.36; variance formally infinite, justifying rank-based rather than parametric inference)
- V1 cohort: 8,656 Polymarket wallets with at least 5 resolved positions
- V1-M Manifold cohort: 15,106 users with at least 25 resolved markets in the cross-venue extension
What V1.5 found (the deferred experiments)
V1.5 ran on 2026-04-27 per AsPredicted #287368. Both pre-registered primary tests failed their ex-ante thresholds. Five exploratory supplementary analyses produced a mixed picture of where V3b holds up.
- E2 per-wallet temporal holdout (frozen V1 coefficients, train 2024-01-01 → 2025-09-30, 14-day embargo, test 2025-10-15 → 2026-04-15, N=805 paired wallets): Spearman = +0.111 [+0.046, +0.175], p ≈ 0.001. Pre-reg pass criterion: ρ ≥ +0.30 AND CI lower > 0. Fail.Refit V3b on the training fold produces ρ = -0.082 with the posture coefficient flipping sign vs the frozen V1 fit.
- E7 per-quarter IC stability (pre-reg quarters Q2 2024 → Q2 2025): median per-quarter Spearman = +0.038. 3 of 5 quarters positive (need ≥5 of 6 per pre-reg). 2025Q1 was strongly negative (ρ = -0.164, p < 0.001, N=1,230). Fail. Per-wallet V3b ranking IS stable across quarters (lag-1 median +0.31, all 8 quarter pairs significant); the stable thing is behavior, not same-quarter alignment with PnL.
- S3 stratified by sample size: V3b predicts strongest at LOW volume and weakens at high volume. Bucket A (5-29 positions, N=3,195): ρ = +0.371. Bucket B (30-100, N=3,089): +0.150. Bucket C (101+, N=1,521): +0.013 (CI crosses zero). Counter to the naive expectation that more data produces a stronger signal. Mechanism: high-volume wallets are more likely market-makers / bots whose PnL is dominated by spread capture rather than directional skill.
- S6 persistent-wallet inversion: 229 wallets active in 4+ quarters show V3b inversely correlated with cumulative signed log PnL, ρ = -0.31 (95% CI [-0.42, -0.20]). Survivorship effect: long-running wallets that stick around include conservative-behavior patterns that protect against ruin without generating large PnL.
- S7 cross-category transfer: V3b refit on one Polymarket category transfers to others with ρ +0.17 to +0.33, comparable to within-category baselines (+0.23, +0.36, +0.27). V3b is more stable across categories than across time.
Honest reframing: V3b's defensible claim moves from “forward predictor of PnL” to cross-sectional ranker of wallet behavior whose temporal alignment with PnL is sample-size and time-window dependent. Full V1.5 paper at /research/edge-score-methodology-v1-5.
What V2.8.2 found (the aggregator stack)
V2.8.2 PRIMARY hypothesis: does W-EXP β=4 + a=2.0 + S-RAW skill-weighted aggregation produce lower mean Brier loss than equal-weighted and market-implied baselines on the held-out window? Per AsPredicted #287436. Cohort substitution to V1-M per AsPredicted #287714. Ran 2026-04-27.
- PRIMARY result: on 6,256 held-out markets, the skill-weighted aggregator produced a Brier delta of +0.058 (95% CI [+0.054, +0.062]) vs equal-weighted baseline and +0.179 (95% CI [+0.164, +0.193]) vs market-implied baseline. Effect threshold per pre-reg was δ < -0.005. Direction reproduced in all 3 cross-fit folds and all 3 execution tiers. Does not reject H0. The aggregator is significantly worse than both baselines, not equivalent to them.
- 24-aggregator sweep: the V2.8.2 pre-registration listed 24 aggregator specifications (PRIMARY plus 23 others). NONE reject H0_market on the V1-M cohort. Best aggregator was W-MANSKI (Imbens-Manski midpoint), with δ_market = +0.057 [+0.049, +0.066]. W-MANSKI beats the equal-weighted baseline (δ_naive = -0.064, Sharpe +0.336) but does so by shrinking aggregates toward 0.5 (mechanically reducing Brier loss at the cost of being uninformative). PRIMARY reproducible bit-for-bit.
- Pre-data gates G2-G8 with real markets-universe data (re-run 2026-04-27 on 163,710 markets with 99.85% gamma metadata coverage): 5 of 9 gates pass. G1 closure 1.00 (V1-M cohort substitution by construction). G4 reachability 1.00. G5 staleness 0.9951. G8 powered categories satisfied. G2 regime stability 0.407 (vs 0.50 threshold), G3 endDate audit 0.9137 (vs 0.95), G2b per-quarter concentration scoping fixable in V2.9. G6 sigma 0.39 (vs 0.063 ceiling) reflects the PRIMARY aggregator's noise; the negative PRIMARY result is robust to high noise (large wide-margin negative). G7 N_w power floor 1.98% reflects V1-M cohort thinness; deferred to V2.9 two-hop closure.
Substantive reading: PnL-skill on Polymarket is not a usable forecast-aggregation weight. The Edge Score V1 weighting (BSS_w as the per-wallet skill score) imported into a forecast- aggregation context produces aggregates that are systematically over-extremized vs ground truth, across the entire pre-registered aggregator family. This reproduces and reinforces the V1-M paper's null finding. Full V2.8.2 results at docs/research/marketalpha/v2.8.2_path_d_results_20260427_220721.md in the public repository; a publishable V2.8.2 paper page is in production.
What the V1 paper does NOT claim
The +0.514 OOF Spearman is moderate, not deterministic. Reference points: IQ tests vs college GPA correlate roughly +0.45, SAT vs first-year GPA roughly +0.5, FICO vs default roughly +0.7-0.8. The Edge Score is a useful rank-ordering signal, not a guarantee of next-period PnL for any individual wallet.
Edge Score is venue-specific by design. The V1-M paper shows that pillar coefficients diverge categorically between Polymarket and Manifold, with the Discipline pillar flipping sign at permutation p = 0.0001. A “Polymarket Edge Score” of 87 is not directly comparable to a “Manifold Edge Score” of 87.
Edge Score does not separate skill from luck on a single wallet's realized PnL history. V1.5 E2 (per-wallet temporal holdout) tested this directly and produced Spearman = +0.111 on N=805 paired wallets, well below the pre-registered +0.30 threshold. Use Edge Score for cross-wallet ranking, not for single-wallet temporal prediction.
Acknowledged limitations
- Survivorship bias. The cohort describes differences among wallets that have ranked on the Polymarket profit leaderboard. Wallets that blew up are not in the cohort. The cohort therefore measures cross-sectional skill differences among survivors, not expected returns for a randomly drawn new trader. Per Peters (2019), ensemble averages on a survivor set do not transfer to the time-average experience of an individual trader. V1.5 adds a supplementary cohort of active-but-unranked wallets to measure the selection shift explicitly.
- V3b vs V1 selection on training cohort. V3b is the shipped composite because its first feature (baseline-adjusted Brier) is structurally harder to game than V1's raw Brier, not because V3b out-beats V1 in a held-out blinded test. The definitive test is a held-out cohort with deliberately injected market-selection gaming, planned as a V1.5 follow-up. The current preference is a principled methodological choice, not a validated empirical one.
- Concentration is size-independent by construction. The Conviction pillar uses share of realized PnL attributable to the wallet's largest single event, not absolute capital. A separate question (does Edge Score correlate with absolute wallet size?) is not formally tested in V1 and remains an open empirical item.
- Forward-validation rolling Spearman is now shipped. Mean +0.391 across 26 monthly windows on the V1-M cohort 2025-10-01 → 2026-04-25; all 26 windows above the +0.147 baseline. Dashboard at /research/forward-validation. Live and updates as new positions resolve.
- Half-life of predictive power tested in V1.5 S5. Median refit Spearman across 30/60/90/180-day cutoffs: -0.03, +0.01, +0.06, +0.11 (per-cutoff range [-0.148, +0.149]). The weakly positive trend toward longer horizons does not establish a conventional decay curve; the per-wallet temporal signal is small at every tested horizon and not a basis for forward prediction.
- Cross-category transfer tested in V1.5 S7. V3b refit on one Polymarket category transfers to others with Spearman +0.17 to +0.33, comparable to within-category baselines (+0.23, +0.36, +0.27). V3b is more stable across categories than across time.
- Persistent-wallet inversion is a new V1.5 finding. 229 wallets active in 4+ quarters show V3b inversely correlated with cumulative signed log PnL (ρ = -0.31, 95% CI [-0.42, -0.20]). Survivorship effect: long-running wallets that stick around include conservative-behavior patterns that protect against ruin without generating large PnL.
Anticipated critiques and responses
The V1 paper includes a formal Responses to anticipated critiques section addressing six sophisticated objections that an academic reviewer would raise: bootstrap-under-fat-tails, cross-venue invariance, V3b-vs-V1 circularity, concentration endogeneity, survivorship bias, and pre-registration semantics. Each response is reasoned and cites the relevant literature. Reviewers are welcome to engage with the responses directly.
Reproducibility
The full V1-M data bundle (542K position-aggregated trades across 8,998 Polymarket wallets, with anonymized addresses) is direct-download: v1m-data-bundle.tar.gz. The bundle is the canonical input for V1, V1-M, and V2.8.2 reproducibility. Anyone with the bundle and Python 3.11+ can rerun the methodology against the same data with no API budget.
Methodology code, validation scripts, and frozen amendment commits are public in the Convexly repository at github.com/RedGridTactical/Convexly.
Contact for methodology review
Academic reviewers, journalists with methodology questions, and institutional buyers running due diligence can email research@convexly.app. Methodology critique is welcome; a critique that surfaces an actual error is more valuable to Convexly than another endorsement.