Methodology | Convexly

Edge Score V3b composite

The composite is a frozen-coefficient three-pillar weighted sum. Coefficients (Posture +0.7876, Conviction +2.7220, Discipline -1.1508) were committed to a version-controlled repository before the validation script ran. Validation produced an out-of-fold Spearman rank correlation of +0.514 between Edge Score and signed log realized PnL on the V1 cohort. A Brier-only baseline produces +0.147 on the same cohort.

Each pillar is z-score standardized against the V1 reference cohort. Posture rewards baseline-adjusted Brier (negated), Conviction rewards PnL concentration on the largest event, Discipline penalizes total resolved-position count (fewer larger bets score higher).

Full derivation, OLS fit, OOF folds, and every intermediate statistic is in the V1 paper.

Frozen-coefficient vs externally pre-registered

V1 is internally pre-registered. The ex-ante methodology document and pass-fail thresholds were committed to git before the validation script ran. The commit history provides the timestamped audit trail. This is internal pre-registration, not external pre-registration with OSF, AsPredicted, or AEA.

V1.5 was externally pre-registered, ran on schedule, and is published. The two ex-ante validation experiments that require per-position outcome data (per-wallet temporal holdout E2, per-quarter Information Coefficient stability E7) were filed at AsPredicted #287368 on 2026-04-25 and ran on 2026-04-27. Both primary tests failed their ex-ante pass criteria. The full result is in the V1.5 section below and at the dedicated V1.5 paper.

V2.8.2 is externally pre-registered. AsPredicted #287436 (in-sample) and #287442 (forward-only), filed before any analysis code ran. Three frozen amendments to V2.8.2 are committed to the repository: one documenting the closure-caps choice, one substituting the V1-M reference cohort after the original two-hop closure plan exceeded available API capacity (filed at AsPredicted #287714), and the audit-trail amendment for the gate-script v1m_cohort_substitution code path. PRIMARY hypothesis test ran on the V1-M cohort 2026-04-27; result is in the V2.8.2 section below.

Validation results

In-sample cross-wallet OOF Spearman: +0.514
Calibration-only baseline Spearman: +0.147
Fama-French (2010) bootstrap null at 10,000 permutations: observed Spearman outside every permuted sample, p < 0.0001
Subgroup stability across six cuts: Spearman range +0.468 to +0.726
Forward-validation rolling 30-day Spearman across 26 monthly windows on the V1-M cohort, 2025-10-01 → 2026-04-25: mean +0.391 (median +0.375). All 26 windows finished above the +0.147 Brier-only baseline. Worst window +0.299, best window +0.520. See the live forward-validation dashboard.
V1.5 per-wallet temporal holdout (frozen V1 coefficients, N=805 paired wallets): Spearman = +0.111 (95% CI [+0.046, +0.175]). Pre-reg threshold ρ ≥ +0.30. Failed the pre-registered pass criterion. Positive and significant, but well below the threshold.
V1.5 partial Spearman of V3b vs PnL controlling for log absolute capital (N=7,805): +0.494 (95% CI [+0.477, +0.512]). Stronger than the marginal +0.322; capital is a suppressor, not a confounder.
Hill tail index alpha on realized PnL: 1.28 (95% CI 1.20-1.36; variance formally infinite, justifying rank-based rather than parametric inference)
V1 cohort: 8,656 Polymarket wallets with at least 5 resolved positions
V1-M Manifold cohort: 15,106 users with at least 25 resolved markets in the cross-venue extension

What V1.5 found (the deferred experiments)

V1.5 ran on 2026-04-27 per AsPredicted #287368. Both pre-registered primary tests failed their ex-ante thresholds. Five exploratory supplementary analyses produced a mixed picture of where V3b holds up.

E2 per-wallet temporal holdout (frozen V1 coefficients, train 2024-01-01 → 2025-09-30, 14-day embargo, test 2025-10-15 → 2026-04-15, N=805 paired wallets): Spearman = +0.111 [+0.046, +0.175], p ≈ 0.001. Pre-reg pass criterion: ρ ≥ +0.30 AND CI lower > 0. Fail.Refit V3b on the training fold produces ρ = -0.082 with the posture coefficient flipping sign vs the frozen V1 fit.
E7 per-quarter IC stability (pre-reg quarters Q2 2024 → Q2 2025): median per-quarter Spearman = +0.038. 3 of 5 quarters positive (need ≥5 of 6 per pre-reg). 2025Q1 was strongly negative (ρ = -0.164, p < 0.001, N=1,230). Fail. Per-wallet V3b ranking IS stable across quarters (lag-1 median +0.31, all 8 quarter pairs significant); the stable thing is behavior, not same-quarter alignment with PnL.
S3 stratified by sample size: V3b predicts strongest at LOW volume and weakens at high volume. Bucket A (5-29 positions, N=3,195): ρ = +0.371. Bucket B (30-100, N=3,089): +0.150. Bucket C (101+, N=1,521): +0.013 (CI crosses zero). Counter to the naive expectation that more data produces a stronger signal. Mechanism: high-volume wallets are more likely market-makers / bots whose PnL is dominated by spread capture rather than directional skill.
S6 persistent-wallet inversion: 229 wallets active in 4+ quarters show V3b inversely correlated with cumulative signed log PnL, ρ = -0.31 (95% CI [-0.42, -0.20]). Survivorship effect: long-running wallets that stick around include conservative-behavior patterns that protect against ruin without generating large PnL.
S7 cross-category transfer: V3b refit on one Polymarket category transfers to others with ρ +0.17 to +0.33, comparable to within-category baselines (+0.23, +0.36, +0.27). V3b is more stable across categories than across time.

Honest reframing: V3b's defensible claim moves from “forward predictor of PnL” to cross-sectional ranker of wallet behavior whose temporal alignment with PnL is sample-size and time-window dependent. Full V1.5 paper at /research/edge-score-methodology-v1-5.

What V2.8.2 found (the aggregator stack)

V2.8.2 PRIMARY hypothesis: does W-EXP β=4 + a=2.0 + S-RAW skill-weighted aggregation produce lower mean Brier loss than equal-weighted and market-implied baselines on the held-out window? Per AsPredicted #287436. Cohort substitution to V1-M per AsPredicted #287714. Ran 2026-04-27.

PRIMARY result: on 6,256 held-out markets, the skill-weighted aggregator produced a Brier delta of +0.058 (95% CI [+0.054, +0.062]) vs equal-weighted baseline and +0.179 (95% CI [+0.164, +0.193]) vs market-implied baseline. Effect threshold per pre-reg was δ < -0.005. Direction reproduced in all 3 cross-fit folds and all 3 execution tiers. Does not reject H0. The aggregator is significantly worse than both baselines, not equivalent to them.
24-aggregator sweep: the V2.8.2 pre-registration listed 24 aggregator specifications (PRIMARY plus 23 others). NONE reject H0_market on the V1-M cohort. Best aggregator was W-MANSKI (Imbens-Manski midpoint), with δ_market = +0.057 [+0.049, +0.066]. W-MANSKI beats the equal-weighted baseline (δ_naive = -0.064, Sharpe +0.336) but does so by shrinking aggregates toward 0.5 (mechanically reducing Brier loss at the cost of being uninformative). PRIMARY reproducible bit-for-bit.
Pre-data gates G2-G8 with real markets-universe data (re-run 2026-04-27 on 163,710 markets with 99.85% gamma metadata coverage): 5 of 9 gates pass. G1 closure 1.00 (V1-M cohort substitution by construction). G4 reachability 1.00. G5 staleness 0.9951. G8 powered categories satisfied. G2 regime stability 0.407 (vs 0.50 threshold), G3 endDate audit 0.9137 (vs 0.95), G2b per-quarter concentration scoping fixable in V2.9. G6 sigma 0.39 (vs 0.063 ceiling) reflects the PRIMARY aggregator's noise; the negative PRIMARY result is robust to high noise (large wide-margin negative). G7 N_w power floor 1.98% reflects V1-M cohort thinness; deferred to V2.9 two-hop closure.

Substantive reading: PnL-skill on Polymarket is not a usable forecast-aggregation weight. The Edge Score V1 weighting (BSS_w as the per-wallet skill score) imported into a forecast- aggregation context produces aggregates that are systematically over-extremized vs ground truth, across the entire pre-registered aggregator family. This reproduces and reinforces the V1-M paper's null finding. Full V2.8.2 results at docs/research/marketalpha/v2.8.2_path_d_results_20260427_220721.md in the public repository; a publishable V2.8.2 paper page is in production.

What the V1 paper does NOT claim

The +0.514 OOF Spearman is moderate, not deterministic. Reference points: IQ tests vs college GPA correlate roughly +0.45, SAT vs first-year GPA roughly +0.5, FICO vs default roughly +0.7-0.8. The Edge Score is a useful rank-ordering signal, not a guarantee of next-period PnL for any individual wallet.

Edge Score is venue-specific by design. The V1-M paper shows that pillar coefficients diverge categorically between Polymarket and Manifold, with the Discipline pillar flipping sign at permutation p = 0.0001. A “Polymarket Edge Score” of 87 is not directly comparable to a “Manifold Edge Score” of 87.

Edge Score does not separate skill from luck on a single wallet's realized PnL history. V1.5 E2 (per-wallet temporal holdout) tested this directly and produced Spearman = +0.111 on N=805 paired wallets, well below the pre-registered +0.30 threshold. Use Edge Score for cross-wallet ranking, not for single-wallet temporal prediction.

Acknowledged limitations

Survivorship bias. The cohort describes differences among wallets that have ranked on the Polymarket profit leaderboard. Wallets that blew up are not in the cohort. The cohort therefore measures cross-sectional skill differences among survivors, not expected returns for a randomly drawn new trader. Per Peters (2019), ensemble averages on a survivor set do not transfer to the time-average experience of an individual trader. V1.5 adds a supplementary cohort of active-but-unranked wallets to measure the selection shift explicitly.
V3b vs V1 selection on training cohort. V3b is the shipped composite because its first feature (baseline-adjusted Brier) is structurally harder to game than V1's raw Brier, not because V3b out-beats V1 in a held-out blinded test. The definitive test is a held-out cohort with deliberately injected market-selection gaming, planned as a V1.5 follow-up. The current preference is a principled methodological choice, not a validated empirical one.
Concentration is size-independent by construction. The Conviction pillar uses share of realized PnL attributable to the wallet's largest single event, not absolute capital. A separate question (does Edge Score correlate with absolute wallet size?) is not formally tested in V1 and remains an open empirical item.
Forward-validation rolling Spearman is now shipped. Mean +0.391 across 26 monthly windows on the V1-M cohort 2025-10-01 → 2026-04-25; all 26 windows above the +0.147 baseline. Dashboard at /research/forward-validation. Live and updates as new positions resolve.
Half-life of predictive power tested in V1.5 S5. Median refit Spearman across 30/60/90/180-day cutoffs: -0.03, +0.01, +0.06, +0.11 (per-cutoff range [-0.148, +0.149]). The weakly positive trend toward longer horizons does not establish a conventional decay curve; the per-wallet temporal signal is small at every tested horizon and not a basis for forward prediction.
Cross-category transfer tested in V1.5 S7. V3b refit on one Polymarket category transfers to others with Spearman +0.17 to +0.33, comparable to within-category baselines (+0.23, +0.36, +0.27). V3b is more stable across categories than across time.
Persistent-wallet inversion is a new V1.5 finding. 229 wallets active in 4+ quarters show V3b inversely correlated with cumulative signed log PnL (ρ = -0.31, 95% CI [-0.42, -0.20]). Survivorship effect: long-running wallets that stick around include conservative-behavior patterns that protect against ruin without generating large PnL.

Anticipated critiques and responses

The V1 paper includes a formal Responses to anticipated critiques section addressing six sophisticated objections that an academic reviewer would raise: bootstrap-under-fat-tails, cross-venue invariance, V3b-vs-V1 circularity, concentration endogeneity, survivorship bias, and pre-registration semantics. Each response is reasoned and cites the relevant literature. Reviewers are welcome to engage with the responses directly.

Reproducibility

The full V1-M data bundle (542K position-aggregated trades across 8,998 Polymarket wallets, with anonymized addresses) is direct-download: v1m-data-bundle.tar.gz. The bundle is the canonical input for V1, V1-M, and V2.8.2 reproducibility. Anyone with the bundle and Python 3.11+ can rerun the methodology against the same data with no API budget.

Methodology code, validation scripts, and frozen amendment commits are public in the Convexly repository at github.com/RedGridTactical/Convexly.

Contact for methodology review

Academic reviewers, journalists with methodology questions, and institutional buyers running due diligence can email research@convexly.app. Methodology critique is welcome; a critique that surfaces an actual error is more valuable to Convexly than another endorsement.