Cohort PnL skill is real on Polymarket. Forecast aggregation skill is not.

Methodology IDs #287436 + #287442 + V1-M cohort substitution amendment #287714 are tracked in the public preregistration manifest; their external AsPredicted public URLs are currently marked receipt pending. PRIMARY hypothesis fails to reject H0 vs market-implied baseline; 24-aggregator sweep extends the result across the frozen family. Sample-CSCV PBO, Cameron-Gelbach-Miller 2011 cluster bootstrap, deterministic across cross-fit folds and execution tiers.

Receipt status caveat

The MarketAlpha V2.8.2 methodology rows are preserved in Convexly's preregistration manifest, but the corresponding public AsPredicted URLs are not yet verified. Until those external receipt pages resolve, cite this as a frozen-methodology result rather than an externally verified preregistration.

View receipt manifest

The question

Does skill-weighted aggregation of Polymarket cohort beliefs beat market-implied probability and equal-weighted aggregation on resolved binary markets? Frozen hypothesis: the Atanasov 2017 / Mellers 2014 canonical aggregator W-EXP with β=4 + extremization a=2.0 + S-RAW skill metric will produce lower mean Brier loss than both naive baselines on the held-out window.

Headline finding

On 6,256 held-out markets in the V1-M cohort, the PRIMARY skill-weighted aggregator produced a Brier delta of +0.058 (95% CI [+0.054, +0.062]) vs the equal-weighted baseline and +0.179(95% CI [+0.164, +0.193]) vs the market-implied baseline. The frozen effect threshold was δ < -0.005 to reject H0. Both CIs are fully positive and exclude zero from below, meaning the aggregator is significantly worse than baselines, not equivalent to them. Direction reproduced in all 3 cross-fit folds and all 3 execution tiers (OPT / MOD / CONS). Sharpe = -0.247 across tiers. Does not reject H0.

The 24-aggregator sweep across the entire frozen family extends the result: zero of 24 aggregator specifications reject H0 vs the market- implied baseline. Best aggregator was W-MANSKI (Imbens-Manski midpoint), which beats the equal-weighted baseline (δ_naive = -0.064, Sharpe +0.336) but does so by shrinking aggregates toward 0.5; mechanical Brier-loss reduction at the cost of being uninformative. PRIMARY result reproducible bit-for-bit across runs (deterministic seeds at every random step).

Substantive reading: PnL-skill on Polymarket is not a usable forecast-aggregation weight. The Edge Score V1 weighting (BSS_w as the per-wallet skill score) imported into a forecast-aggregation context produces aggregates that are systematically over-extremized vs ground truth, across the entire frozen aggregator family. This reproduces and reinforces the V1-M paper's null finding that PnL-skill does not transfer cleanly into forecast accuracy.

Pre-data gates G1-G8 (real-data re-run)

The pre-data gates were re-run 2026-04-27 against a real markets-universe CSV (163,710 markets, 99.85% gamma metadata coverage), replacing the synthetic smoke-input CSVs from the build_smoke_inputs.py stand-in. Results: 5 of 9 gates pass.

PASS (5): G1 closure 1.00 (V1-M cohort substitution by construction), G4 reachability 1.00, G5 staleness 0.9951, G8 powered categories satisfied (politics 5,938 + sports 50,319 zero-cohort markets).
FAIL just-below-threshold (3): G2 regime stability 0.407 vs 0.50 threshold (81% of threshold), G3 endDate audit 0.9137 vs 0.95 (96% of threshold; ~14k late-settling sub-second crypto-tick markets dragging it down), G2b per-quarter regime concentration scoping fixable in V2.9 by trimming to 2025-Q3+ window.
FAIL architectural (2): G6 sigma sensitivity 0.39 vs 0.063 ceiling (the PRIMARY aggregator's empirical noise; the negative PRIMARY result is wide-margin and survives the high noise), G7 N_w power floor 1.98% reflects V1-M cohort thinness per market (~1-2 cohort wallets per market) and is the V2.9 deferred two-hop closure target.

V1-M cohort substitution amendment

The original V2.8.2 manifest entry (#287436) specified a cohort built via PnL-decile seeds plus depth-2 graph closure from Polymarket activity. The FULL data engineering run on 2026-04-26 hit the 18h wall-clock cap before two-hop closure completed; pre-data gate G1 (closure ≥90%) correctly failed binding evaluation. Per the closure-caps amendment §3 remediation paths, the cohort substitution amendment (AsPredicted #287714, filed 2026-04-27) replaced the cohort with the V1-M reference cohort: 8,656 Polymarket wallets from the position tape snapshot 2026-04-25 20:18 UTC. The V1-M cohort satisfies G1 closure by construction (the cohort IS the boundary). The two-hop closure to a broader cohort is deferred to V2.9 pending multi-IP API capacity. All other pre-reg structure (the 11 aggregators / 3 cohort tracks / 3 execution tiers / 9 pre-data gates / 10 negative controls A-J / Cameron-Gelbach-Miller cluster bootstrap / Sample-CSCV PBO / Galwey M_e) is preserved unchanged.

Caveats (re-run before any new external claim)

Bootstrap N = 1,000 (pre-reg setting is 10,000). The 2026-04-27 run used 1,000 for fast turnaround. Direction will not change at 10,000 (the CIs already exclude the equivalence band by an order of magnitude), but the pre-reg-compliant version requires the full bootstrap.
W-MARKET baseline uses climatology stand-in. The full W-MARKET fallback ladder per pre-reg §1.5 is not wired; the climatology stand-in approximates it for the V1-M cohort substitution. Documented in the orchestrator docstring step 6.
Test 4 (per-category) returned empty results. V1-M position tape lacks Polymarket category labels; category-stratified analysis would require enriching the tape with gamma category metadata. Coverage limitation, not methodology failure.
G2 through G8 fails are documented above; the V2.8.2 PRIMARY conclusion stands because the PRIMARY aggregator's negative result is wide-margin and survives the noise (G6 specifically).

Methodology

11 aggregator families per pre-reg §0.2 binding list (W-EQ, W-VOL, W-LIN, W-EXP, W-TOPK, W-SHRINK, W-TRIM, W-LOGPOOL, W-MANSKI, W-IPW, W-RANDPOP), tested at 24 specifications total including PRIMARY plus the variant beta / extremization / K parameter values listed in §1.3 + Tier 2 §0.2. Tier 1 PRIMARY is W-EXP β=4 + a=2.0 + S-RAW (Atanasov 2017 canonical).
3 cohort tracks: PnL-decile primary, random N=10,000 baseline, cross-fit 2-fold × 3 fixed seeds.
3 execution tiers: OPT (zero slippage), MOD (Polymarket Kyle-lambda PRIMARY with Almgren-Chriss tau=0.5 fallback per pre-reg §1.5(e)), CONS (MOD × 2.0).
9 pre-data gates G1-G8 + G2b: cohort closure ≥90%, regime stability, gamma.endDate audit, subgraph reachability, resolution staleness, sigma sensitivity, N_w power floor, zero-cohort markets per category.
10 negative controls A-J with TOST equivalence margins per Lakens 2017 pre-frozen in pre-reg §0.6.
Cameron-Gelbach-Miller 2011 three-loop bootstrap V_2way = V_market + V_wallet - V_intersection (PRIMARY). Davezies-D'Haultfœuille-Guyonvarch 2021 single-multinomial with cluster-jackknife BCa (SECONDARY validation).
Sample-CSCV PBO (Bailey-Borwein-Lopez de Prado-Zhu 2014) at N=14,400 partitions × S=20 chronological submatrices with ≥7-day walk-forward gap. PBO ≤ 0.30 BINDING. Wilson 95% CI ±0.0075.
Galwey 2009 spectral M_e on the (S, K) score matrix with deflation factor max(K_total, M_e).

Available resources

Methodology receipt manifest: AsPredicted status page tracks #287,436 (in-sample), #287,442 (forward-only), and #287,714 (cohort amendment). Their external public URLs are currently marked receipt pending; the code was frozen at commit a09533c on 2026-04-26.
V1-M public data bundle: /research/v1m/v1m-data-bundle.tar.gz (542K position-aggregated trades across 8,998 wallets).
Wallet-label pilot feed: daily versioned cross-venue dataset for institutions. Sales-led; email research@convexly.app for inquiry.

AI tooling disclosure

This work used AI tools (Claude, GPT-4) as research aids during methodology design and pre-publication review. All claims, statistical results, and figures are reproducible from the public data bundle and the frozen-commit code at commit a09533c. No claim on this page is taken as true on the basis of an AI tool's output; every quantitative result is recomputable from the bundle with the documented seed. External preregistration receipt status is tracked separately in the public manifest.

Citation

Convexly Research. (2026). MarketAlpha V2.8.2: Frozen-Methodology Audit of Polymarket Cohort Skill. https://www.convexly.app/research/marketalpha-v2

Contact + collaboration

Schedule a 30-minute technical call with the research desk to walk through methodology, discuss bespoke cohort construction, or request a pilot of the wallet-label feed. Email research@convexly.app. The PRIMARY verdict + 24-aggregator sweep + G2-G8 results are fully public on this page; methodology and infrastructure are fully public in the repository.