Methodology · last updated 2026-05-08
What Convexly's methodology is, and what it isn't
This page summarizes the Edge Score methodology stack (V1, V1-M, V1.5, V2.8.2, forward-validation), the validation results against frozen coefficients, the explicit limitations, and what each follow-up actually found. The full ex-ante filing track lives at /research/preregistrations.
Methodology status 2026-05-08
V1.5 deferred experiments E2 (per-wallet temporal holdout) and E7 (per-quarter IC stability) ran on the V1-M position tape. Both primary tests failed their ex-ante thresholds. V2.8.2 (24-aggregator sweep on V1-M) finds no aggregator beats the market-implied baseline. The honest reframing: V3b is a cross-sectional skill ranker that holds up across in-sample OOF, forward-validation, and partial-correlation control for capital. It is not a per-wallet temporal predictor and it is not a forecast-aggregation weight. Full results in the V1.5 and V2.8.2 sections below.
Methodology track
Convexly publishes a full evidence track: methodology committed to the repository before any analysis runs; follow-up tests filed with their pass / fail thresholds before the data is touched; results published whether they pass or fail. The complete filing index, with verdict and audit-chain anchor for each, lives at /research/preregistrations. Failed tests appear in the negative-result registry; everything is auditable client-side at /research/verify.
Edge Score V3b composite
The composite is a frozen-coefficient three-pillar weighted sum. Coefficients (Posture +0.7876, Conviction +2.7220, Discipline -1.1508) were committed to a version-controlled repository before the validation script ran. Validation produced an out-of-fold Spearman rank correlation of +0.514 between Edge Score and signed log realized PnL on the V1 cohort. A Brier-only baseline produces +0.147 on the same cohort.
Each pillar is z-score standardized against the V1 reference cohort. Posture rewards baseline-adjusted Brier (negated), Conviction rewards PnL concentration on the largest event, Discipline penalizes total resolved-position count (fewer larger bets score higher).
Full derivation, OLS fit, OOF folds, and every intermediate statistic is in the V1 paper.
Frozen-coefficient vs ex-ante external filings
V1 is internally ex-ante. The methodology document and pass-fail thresholds were committed to git before the validation script ran. The commit history provides the timestamped audit trail.
V1.5 was filed externally, ran on schedule, and is published. The two ex-ante validation experiments that require per-position outcome data (per-wallet temporal holdout E2, per-quarter Information Coefficient stability E7) were filed externally on 2026-04-25 and ran on 2026-04-27. Both primary tests failed their ex-ante pass criteria. The full result is in the V1.5 section below and at the dedicated V1.5 paper.
V2.8.2 was filed externally. Both in-sample and forward-only filings landed before any analysis code ran. Three frozen amendments are committed to the repository: one documenting the closure-caps choice, one substituting the V1-M reference cohort after the original two-hop closure plan exceeded available API capacity, and the audit-trail amendment for the gate-script v1m_cohort_substitution code path. The primary hypothesis test ran on the V1-M cohort 2026-04-27; result is in the V2.8.2 section below.
Per-filing IDs, receipt status, and full ex-ante text live at /research/preregistrations.
Validation results
- In-sample cross-wallet OOF Spearman: +0.514
- Calibration-only baseline Spearman: +0.147
- Fama-French (2010) bootstrap null at 10,000 permutations: observed Spearman outside every permuted sample, p < 0.0001
- Subgroup stability across six cuts: Spearman range +0.468 to +0.726
- Forward-validation rolling 30-day Spearman across 26 monthly windows on the V1-M cohort, 2025-10-01 → 2026-04-25: mean +0.391 (median +0.375). All 26 windows finished above the +0.147 Brier-only baseline. Worst window +0.299, best window +0.520. See the live forward-validation dashboard.
- V1.5 per-wallet temporal holdout (frozen V1 coefficients, N=805 paired wallets): Spearman = +0.111 (95% CI [+0.046, +0.175]). Ex-ante threshold ρ ≥ +0.30. Failed the ex-ante pass criterion. Positive and significant, but well below the threshold.
- V1.5 partial Spearman of V3b vs PnL controlling for log absolute capital (N=7,805): +0.494 (95% CI [+0.477, +0.512]). Stronger than the marginal +0.322; capital is a suppressor, not a confounder.
- Hill tail index alpha on realized PnL: 1.28 (95% CI 1.20-1.36; variance formally infinite, justifying rank-based rather than parametric inference)
- V1 cohort: 8,656 Polymarket wallets with at least 5 resolved positions (frozen reference)
- V1.5 position tape: 8,778 wallets (V1 cohort plus a small set of wallets that entered the position tape after the V1 freeze; the V1.5 analyses use the larger superset). The full pre-bundle position tape contains 8,998 wallet rows; the additional 220 are filtered out by the V1.5 skill-window inclusion criterion (≥1 resolved position with `last_fill_ts` in the in-sample window). All three numbers ship in the public Polymarket data bundle.
- V1-M Manifold cohort: 15,106 users with at least 25 resolved markets in the cross-venue extension
What V1.5 found (the deferred experiments)
V1.5 ran on 2026-04-27 per its ex-ante external filing. Both primary tests failed their pre-committed thresholds. Five exploratory supplementary analyses produced a mixed picture of where V3b holds up.
- E2 per-wallet temporal holdout (frozen V1 coefficients, train 2024-01-01 → 2025-09-30, 14-day embargo, test 2025-10-15 → 2026-04-15, N=805 paired wallets): Spearman = +0.111 [+0.046, +0.175], p ≈ 0.001. Pre-reg pass criterion: ρ ≥ +0.30 AND CI lower > 0. Fail.Refit V3b on the training fold produces ρ = -0.082 with the posture coefficient flipping sign vs the frozen V1 fit.
- E7 per-quarter IC stability(pre-reg quarters Q2 2024 → Q2 2025): median per-quarter Spearman = +0.038. 3 of 5 quarters positive (need ≥5 of 6 per pre-reg). 2025Q1 was strongly negative (ρ = -0.164, p < 0.001, N=1,230). Fail. Per-wallet V3b ranking IS stable across quarters (lag-1 median +0.31, all 8 quarter pairs significant); the stable thing is behavior, not same-quarter alignment with PnL.
- S3 stratified by sample size: V3b predicts strongest at LOW volume and weakens at high volume. Bucket A (5-29 positions, N=3,195): ρ = +0.371. Bucket B (30-100, N=3,089): +0.150. Bucket C (101+, N=1,521): +0.013 (CI crosses zero). Counter to the naive expectation that more data produces a stronger signal. Mechanism: high-volume wallets are more likely market-makers / bots whose PnL is dominated by spread capture rather than directional skill.
- S6 persistent-wallet inversion: 229 wallets active in 4+ quarters show V3b inversely correlated with cumulative signed log PnL, ρ = -0.31 (95% CI [-0.42, -0.20]). Survivorship effect: long-running wallets that stick around include conservative-behavior patterns that protect against ruin without generating large PnL.
- S7 cross-category transfer: V3b refit on one Polymarket category transfers to others with ρ +0.17 to +0.33, comparable to within-category baselines (+0.23, +0.36, +0.27). V3b is more stable across categories than across time.
Honest reframing: V3b's defensible claim moves from “forward predictor of PnL” to cross-sectional ranker of wallet behavior whose temporal alignment with PnL is sample-size and time-window dependent. Full V1.5 paper at /research/edge-score-methodology-v1-5.
What V2.8.2 found (the aggregator stack)
V2.8.2 PRIMARY hypothesis: does W-EXP β=4 + a=2.0 + S-RAW skill-weighted aggregation produce lower mean Brier loss than equal-weighted and market-implied baselines on the held-out window? The in-sample filing and V1-M cohort-substitution amendment are tracked in the public receipt manifest. Ran 2026-04-27.
- PRIMARY result: on 6,256 held-out markets, the skill-weighted aggregator produced a Brier delta of +0.058 (95% CI [+0.054, +0.062]) vs equal-weighted baseline and +0.179 (95% CI [+0.164, +0.193])vs market-implied baseline. Effect threshold per pre-reg was δ < -0.005. Direction reproduced in all 3 cross-fit folds and all 3 execution tiers. Does not reject H0. The aggregator is significantly worse than both baselines, not equivalent to them.
- 24-aggregator sweep: the V2.8.2 pre-registration listed 24 aggregator specifications (PRIMARY plus 23 others). NONE reject H0_market on the V1-M cohort. Best aggregator was W-MANSKI (Imbens-Manski midpoint), with δ_market = +0.057 [+0.049, +0.066]. W-MANSKI beats the equal-weighted baseline (δ_naive = -0.064, Sharpe +0.336) but does so by shrinking aggregates toward 0.5 (mechanically reducing Brier loss at the cost of being uninformative). PRIMARY reproducible bit-for-bit.
- Pre-data gates G2-G8 with real markets-universe data(re-run 2026-04-27 on 163,710 markets with 99.85% gamma metadata coverage): 5 of 9 gates pass. G1 closure 1.00 (V1-M cohort substitution by construction). G4 reachability 1.00. G5 staleness 0.9951. G8 powered categories satisfied. G2 regime stability 0.407 (vs 0.50 threshold), G3 endDate audit 0.9137 (vs 0.95), G2b per-quarter concentration scoping fixable in V2.9. G6 sigma 0.39 (vs 0.063 ceiling) reflects the PRIMARY aggregator's noise; the negative PRIMARY result remains negative across the noise band (point estimate +0.179, 95% CI [+0.164, +0.193]; the lower bound stays well above the −0.005 effect threshold). G7 N_w power floor 1.98% reflects V1-M cohort thinness; deferred to V2.9 two-hop closure.
Substantive reading: PnL-skill on Polymarket is nota usable forecast-aggregation weight. The Edge Score V1 weighting (BSS_w as the per-wallet skill score) imported into a forecast- aggregation context produces aggregates that are systematically over-extremized vs ground truth, across the entire ex-ante aggregator family. This reproduces and reinforces the V1-M paper's null finding. Full V2.8.2 results at docs/research/marketalpha/v2.8.2_path_d_results_20260427_220721.md in the public repository; a publishable V2.8.2 paper page is in production.
What the V1 paper does NOT claim
The +0.514 OOF Spearman is moderate, not deterministic. Reference points: IQ tests vs college GPA correlate roughly +0.45, SAT vs first-year GPA roughly +0.5, FICO vs default roughly +0.7-0.8. The Edge Score is a useful rank-ordering signal, not a guarantee of next-period PnL for any individual wallet.
Edge Score is venue-specific by design. The V1-M paper shows that pillar coefficients diverge categorically between Polymarket and Manifold, with the Discipline pillar flipping sign at permutation p = 0.0001. A “Polymarket Edge Score” of 87 is not directly comparable to a “Manifold Edge Score” of 87.
Edge Score does not separate skill from luck on a single wallet's realized PnL history. V1.5 E2 (per-wallet temporal holdout) tested this directly and produced Spearman = +0.111 on N=805 paired wallets, well below the ex-ante +0.30 threshold. Use Edge Score for cross-wallet ranking, not for single-wallet temporal prediction.
Acknowledged limitations
- Survivorship bias. The cohort describes differences among wallets that have ranked on the Polymarket profit leaderboard. Wallets that blew up are not in the cohort. The cohort therefore measures cross-sectional skill differences among survivors, not expected returns for a randomly drawn new trader. Per Peters (2019), ensemble averages on a survivor set do not transfer to the time-average experience of an individual trader. V1.5 adds a supplementary cohort of active-but-unranked wallets to measure the selection shift explicitly.
- V3b vs V1 selection on training cohort. V3b is the shipped composite because its first feature (baseline-adjusted Brier) is structurally harder to game than V1's raw Brier, not because V3b out-beats V1 in a held-out blinded test. The definitive test is a held-out cohort with deliberately injected market-selection gaming, planned as a V1.5 follow-up. The current preference is a principled methodological choice, not a validated empirical one.
- Concentration is size-independent by construction.The Conviction pillar uses share of realized PnL attributable to the wallet's largest single event, not absolute capital. A separate question (does Edge Score correlate with absolute wallet size?) is not formally tested in V1 and remains an open empirical item.
- Forward-validation rolling Spearman is now shipped. Mean +0.391 across 26 monthly windows on the V1-M cohort 2025-10-01 → 2026-04-25; all 26 windows above the +0.147 baseline. Dashboard at /research/forward-validation. Live and updates as new positions resolve.
- Half-life of predictive power tested in V1.5 S5. Median refit Spearman across 30/60/90/180-day cutoffs: -0.03, +0.01, +0.06, +0.11 (per-cutoff range [-0.148, +0.149]). The weakly positive trend toward longer horizons does not establish a conventional decay curve; the per-wallet temporal signal is small at every tested horizon and not a basis for forward prediction.
- Cross-category transfer tested in V1.5 S7. V3b refit on one Polymarket category transfers to others with Spearman +0.17 to +0.33, comparable to within-category baselines (+0.23, +0.36, +0.27). V3b is more stable across categories than across time.
- Persistent-wallet inversion is a new V1.5 finding. 229 wallets active in 4+ quarters show V3b inversely correlated with cumulative signed log PnL (ρ = -0.31, 95% CI [-0.42, -0.20]). Survivorship effect: long-running wallets that stick around include conservative-behavior patterns that protect against ruin without generating large PnL.
Anticipated critiques and responses
The V1 paper includes a formal Responses to anticipated critiques section addressing six sophisticated objections that an academic reviewer would raise: bootstrap-under-fat-tails, cross-venue invariance, V3b-vs-V1 circularity, concentration endogeneity, survivorship bias, and pre-registration semantics. Each response is reasoned and cites the relevant literature. Reviewers are welcome to engage with the responses directly.
Reproducibility
The full V1-M data bundle (542K position-aggregated trades across 8,998 Polymarket wallets, with anonymized addresses) is direct-download: v1m-data-bundle.tar.gz. The bundle is the canonical input for V1, V1-M, and V2.8.2 reproducibility. Anyone with the bundle and Python 3.11+ can rerun the methodology against the same data with no API budget.
Methodology code, validation scripts, and frozen amendment commits are public in the Convexly repository at github.com/RedGridTactical/Convexly.
Contact for methodology review
Academic reviewers, journalists with methodology questions, and institutional buyers running due diligence can email research@convexly.app. Methodology critique is welcome; a critique that surfaces an actual error is more valuable to Convexly than another endorsement.