MarketAlpha V2.8.2 / Pre-Registered Audit
Cohort PnL skill is real on Polymarket. Forecast aggregation skill is not.
Pre-registered at AsPredicted #287436 + #287442 + V1-M cohort substitution amendment #287714. PRIMARY hypothesis fails to reject H0 vs market-implied baseline; 24-aggregator sweep extends the result across the entire pre-registered family. Sample-CSCV PBO, Cameron-Gelbach-Miller 2011 cluster bootstrap, deterministic across cross-fit folds and execution tiers.
The question
Does skill-weighted aggregation of Polymarket cohort beliefs beat market-implied probability and equal-weighted aggregation on resolved binary markets? Pre-registered hypothesis: the Atanasov 2017 / Mellers 2014 canonical aggregator W-EXP with β=4 + extremization a=2.0 + S-RAW skill metric will produce lower mean Brier loss than both naive baselines on the held-out window.
Headline finding
On 6,256 held-out markets in the V1-M cohort, the PRIMARY skill-weighted aggregator produced a Brier delta of +0.058 (95% CI [+0.054, +0.062]) vs the equal-weighted baseline and +0.179 (95% CI [+0.164, +0.193]) vs the market-implied baseline. The pre-registered effect threshold was δ < -0.005 to reject H0. Both CIs are fully positive and exclude zero from below, meaning the aggregator is significantly worse than baselines, not equivalent to them. Direction reproduced in all 3 cross-fit folds and all 3 execution tiers (OPT / MOD / CONS). Sharpe = -0.247 across tiers. Does not reject H0.
The 24-aggregator sweep across the entire pre-registered family extends the result: zero of 24 pre-registered aggregator specifications reject H0 vs the market- implied baseline. Best aggregator was W-MANSKI (Imbens-Manski midpoint), which beats the equal-weighted baseline (δ_naive = -0.064, Sharpe +0.336) but does so by shrinking aggregates toward 0.5; mechanical Brier-loss reduction at the cost of being uninformative. PRIMARY result reproducible bit-for-bit across runs (deterministic seeds at every random step).
Substantive reading: PnL-skill on Polymarket is not a usable forecast-aggregation weight. The Edge Score V1 weighting (BSS_w as the per-wallet skill score) imported into a forecast-aggregation context produces aggregates that are systematically over-extremized vs ground truth, across the entire pre-registered aggregator family. This reproduces and reinforces the V1-M paper's null finding that PnL-skill does not transfer cleanly into forecast accuracy.
Pre-data gates G1-G8 (real-data re-run)
The pre-data gates were re-run 2026-04-27 against a real markets-universe CSV (163,710 markets, 99.85% gamma metadata coverage), replacing the synthetic smoke-input CSVs from the build_smoke_inputs.py stand-in. Results: 5 of 9 gates pass.
- PASS (5): G1 closure 1.00 (V1-M cohort substitution by construction), G4 reachability 1.00, G5 staleness 0.9951, G8 powered categories satisfied (politics 5,938 + sports 50,319 zero-cohort markets).
- FAIL just-below-threshold (3): G2 regime stability 0.407 vs 0.50 threshold (81% of threshold), G3 endDate audit 0.9137 vs 0.95 (96% of threshold; ~14k late-settling sub-second crypto-tick markets dragging it down), G2b per-quarter regime concentration scoping fixable in V2.9 by trimming to 2025-Q3+ window.
- FAIL architectural (2): G6 sigma sensitivity 0.39 vs 0.063 ceiling (the PRIMARY aggregator's empirical noise; the negative PRIMARY result is wide-margin and survives the high noise), G7 N_w power floor 1.98% reflects V1-M cohort thinness per market (~1-2 cohort wallets per market) and is the V2.9 deferred two-hop closure target.
V1-M cohort substitution amendment
The original V2.8.2 pre-registration (#287436) specified a cohort built via PnL-decile seeds plus depth-2 graph closure from Polymarket activity. The FULL data engineering run on 2026-04-26 hit the 18h wall-clock cap before two-hop closure completed; pre-data gate G1 (closure ≥90%) correctly failed binding evaluation. Per the closure-caps amendment §3 remediation paths, the cohort substitution amendment (AsPredicted #287714, filed 2026-04-27) replaced the cohort with the V1-M reference cohort: 8,656 Polymarket wallets from the position tape snapshot 2026-04-25 20:18 UTC. The V1-M cohort satisfies G1 closure by construction (the cohort IS the boundary). The two-hop closure to a broader cohort is deferred to V2.9 pending multi-IP API capacity. All other pre-reg structure (the 11 aggregators / 3 cohort tracks / 3 execution tiers / 9 pre-data gates / 10 negative controls A-J / Cameron-Gelbach-Miller cluster bootstrap / Sample-CSCV PBO / Galwey M_e) is preserved unchanged.
Caveats (re-run before any new external claim)
- Bootstrap N = 1,000 (pre-reg setting is 10,000). The 2026-04-27 run used 1,000 for fast turnaround. Direction will not change at 10,000 (the CIs already exclude the equivalence band by an order of magnitude), but the pre-reg-compliant version requires the full bootstrap.
- W-MARKET baseline uses climatology stand-in. The full W-MARKET fallback ladder per pre-reg §1.5 is not wired; the climatology stand-in approximates it for the V1-M cohort substitution. Documented in the orchestrator docstring step 6.
- Test 4 (per-category) returned empty results. V1-M position tape lacks Polymarket category labels; category-stratified analysis would require enriching the tape with gamma category metadata. Coverage limitation, not methodology failure.
- G2 through G8 fails are documented above; the V2.8.2 PRIMARY conclusion stands because the PRIMARY aggregator's negative result is wide-margin and survives the noise (G6 specifically).
Methodology
- 11 aggregator families per pre-reg §0.2 binding list (W-EQ, W-VOL, W-LIN, W-EXP, W-TOPK, W-SHRINK, W-TRIM, W-LOGPOOL, W-MANSKI, W-IPW, W-RANDPOP), tested at 24 specifications total including PRIMARY plus the variant beta / extremization / K parameter values listed in §1.3 + Tier 2 §0.2. Tier 1 PRIMARY is W-EXP β=4 + a=2.0 + S-RAW (Atanasov 2017 canonical).
- 3 cohort tracks: PnL-decile primary, random N=10,000 baseline, cross-fit 2-fold × 3 fixed seeds.
- 3 execution tiers: OPT (zero slippage), MOD (Polymarket Kyle-lambda PRIMARY with Almgren-Chriss tau=0.5 fallback per pre-reg §1.5(e)), CONS (MOD × 2.0).
- 9 pre-data gates G1-G8 + G2b: cohort closure ≥90%, regime stability, gamma.endDate audit, subgraph reachability, resolution staleness, sigma sensitivity, N_w power floor, zero-cohort markets per category.
- 10 negative controls A-J with TOST equivalence margins per Lakens 2017 pre-frozen in pre-reg §0.6.
- Cameron-Gelbach-Miller 2011 three-loop bootstrap V_2way = V_market + V_wallet - V_intersection (PRIMARY). Davezies-D'Haultfœuille-Guyonvarch 2021 single-multinomial with cluster-jackknife BCa (SECONDARY validation).
- Sample-CSCV PBO (Bailey-Borwein-Lopez de Prado-Zhu 2014) at N=14,400 partitions × S=20 chronological submatrices with ≥7-day walk-forward gap. PBO ≤ 0.30 BINDING. Wilson 95% CI ±0.0075.
- Galwey 2009 spectral M_e on the (S, K) score matrix with deflation factor max(K_total, M_e).
Cross-examination protocol
Two-phase pre-publication audit. Phase 1: automated checks of pre-registration adherence, the CI-with-every-point-estimate rule, sub-sample disclosure, and credibility-loaded-term grep. Phase 2: five independent statistical reviews spawned in parallel (quant-style, Lopez de Prado AFML, Tetlock and Atanasov forecasting standard, internal consistency, hostile peer reviewer). Phase 3 synthesis aggregates findings; critical count must be zero before publication.
Pre-publication cross-exam already executed. Phase 1 PASSES on current paper draft (sensitivity intact: still catches injected violations). Phase 2 surfaced 10 CRITICAL methodology issues fixed in-session before FULL data engineering kicked off (PnL temporal-leakage 27% propensity-gap inflation, Kyle-lambda R² < 0.005 collapsing 3 execution tiers to 2, Galwey M_e undercount of effective-K, BSS_w upstream validation gap, walk-forward gap not implemented).
Available resources
- Pre-registration: AsPredicted #287,436 (in-sample) + #287,442 (forward-only). Both frozen at commit a09533c on 2026-04-26.
- V1-M public data bundle: /research/v1m/v1m-data-bundle.tar.gz (542K position-aggregated trades across 8,998 wallets).
- Cross-exam reports + synthesis: 5 lens reports + Phase 3 synthesis available on request to research@convexly.app.
- Wallet-label pilot feed: daily versioned cross-venue dataset for institutions. Sales-led; email research@convexly.app for inquiry.
Citation
Convexly Research. (2026). MarketAlpha V2.8.2: Pre-Registered Audit of Polymarket Cohort Skill. AsPredicted #287,436. https://www.convexly.app/research/marketalpha-v2
Contact + collaboration
Schedule a 30-minute technical call with the research desk to walk through methodology, discuss bespoke cohort construction, or request a pilot of the wallet-label feed. Email research@convexly.app. The PRIMARY verdict + 24-aggregator sweep + G2-G8 results are fully public on this page; methodology and infrastructure are fully public in the repository.