Methods & Evidence

What Convexly can safely claim today.

This page separates usable evidence from preview work. Start here when you want to know what is live, what is still experimental, what failed, and which receipt backs a product claim.

Receipt status: the V1.5 AsPredicted receipt is externally verified. Several newer methodology entries are filed in Convexly's manifest but still marked public receipt not yet verified until the verifier can match the AsPredicted ID, title, and filing date. V1 and V1-M are not retroactively pre-registered; treat them as frozen-coefficient, version-controlled methodology papers with reproducible data bundles, not external preregistration receipts. View the live receipt manifest.

Live

Wallet diagnostics

Frozen-coefficient wallet diagnostics and public analyzer outputs are mature enough to use as behavior evidence, not as profit forecasts.

Preview

Market Trust packets

Market Trust is useful for review today, but v1 promotion waits on resolved-outcome history, source freshness, and the frozen validation contract.

Experimental

Related-market checks

Related-market checks can flag prices that deserve review. Performance claims wait on settled outcomes and matched controls.

Not yet claimable

Alpha and enterprise-SLA claims

Convexly should not claim alpha, full coverage, or enterprise-grade operations until external review and outcome evidence support it.

July 9, 2026
Profit Is Not Proof: we ran the luck filter on our own Polymarket leaderboard during the World Cup
A false-discovery skill-vs-luck separation on resolved Polymarket positions, run during the biggest volume month prediction markets have had. Platform-wide, 178 of 3,871 wallets with at least 30 resolved discretionary positions clear Benjamini-Hochberg control at q = 0.10 for positive realized edge over entry prices, about one in twenty-two, in sample. Pointed at Convexly's own published top-50 leaderboard: zero of the 35 readable wallets clear the same bar. Skill is findable in the settled record; profit leaderboards are the wrong instrument for finding it.
June 26, 2026
The Luck Share: how much of our top-50 cohort cannot be separated from chance
A recurring, citable index: the share of our own published top-50 Polymarket cohort whose realized record cannot be separated from chance at the frozen, FDR-corrected bar. Latest reading 100 percent, with 0 of the 35 testable wallets clearing the bar (15 of 50 had too few resolved positions to test); the 95 percent upper bound on the clear-rate is about 9 percent. Sourced verbatim from the top-50 skill scan. Describes behavior, never an endorsement.
June 26, 2026
The prediction-market skill lexicon
A fixed public vocabulary for reading prediction-market wallet quality, each term defined once as a statistical property with its threshold: realized entry edge, FDR-cleared, the skill-vs-luck verdict states (skilled, not separable from chance, insufficient, plus a reserved integrity flag), the concentrated style label, and Edge Score. Statistical properties, never endorsements.
May 19, 2026
Market Trust v1 promotion contract: evidence gates from v0.1 and v0.2 lessons
Defines the evidence contract that blocks Market Trust v1 language until v0.1 static-card lessons and v0.2 canary-packet lessons are backed by coverage breadth, clean source-health windows, measured liquidity/depth, measured participant-quality evidence, resolved or human-reviewed outcomes, and an explicit formula freeze.
May 13, 2026
Hantavirus cluster: cross-market mispricing scan on five linked Polymarket events
Ad-hoc cross-market belief-decomposition scan of five linked Polymarket hantavirus events (US case by May 15, PHEIC by June 30, lab leak by June 30, pandemic in 2026, FDA vaccine in 2026), all outside the daily CME pipeline universe. One MEDIUM flag: implied P(pandemic | PHEIC by year-end) ~38-55% vs historical 2 of 8 PHEICs (point 25%, 95% Clopper-Pearson 4-65%). The short-horizon 'US case by May 15' at 27.5% sits ~5-15 pp above a 48-hour Poisson baseline of ~15-20% (P(≥1) = 1 − e^(−λ·t), λ = 0.082-0.110/day, t = 2 days); gap is within thin-liquidity friction at $14.7K. Earlier draft used the annual base rate and overclaimed an ~68-72 pp gap; corrected inline.
May 9, 2026
Macro Coverage + Coherence v0.2: source registry, mapping queue, and daily bundle outputs
A method freeze and live substrate for macro prediction-market coverage. Adds official benchmark-source registry rows, reviewed macro equivalence keys, cross-venue mapping debt, daily Macro CME bundle-readiness rows, and macro-specific Market Trust caveats without claiming complete coverage, venue-approved mappings, tradable expected value, or macro forecasting skill.
May 9, 2026
Macro Market Trust v0.1: prediction markets as macro infrastructure
A focused taxonomy and live ledger-mapping surface for macro prediction markets across inflation, labor, rates, growth, volatility, policy/fiscal, and crypto-macro. Defines macro-specific Market Trust fields, candidate CME bundles, data-rights caveats, and low-hype public framing without claiming complete coverage or tradeable alpha.
May 8, 2026
Cross-venue consensus probability PoC: co-listed markets with visible tolerances
An inspectable proof page for venue-neutral consensus probability. Reads venue_markets and venue_market_price_snapshots, requires either an explicit cross_venue_key or conservative normalized question match, then publishes spread-weighted probability, venue dispersion, match basis, caveats, and row hash. This is a diagnostic substrate for Market Trust and CME, not a skill-weighted alpha claim.
May 4, 2026
V3b over V1: model selection under market-selection bias
Why the V3b Edge Score variant is preferred over V1 when the training cohort is selected on the outcome under study. Fold-local refit and a Fama-French bootstrap null make V3b and V1 empirically distinguishable.
April 29, 2026
Coherence Signals: daily three-gate filter on Polymarket cross-market clusters
Live daily feed of cross-market structural-mispricing candidates surfaced by the Coherent Markets Engine. Each Polymarket event group is screened against probability axioms (binary additivity, mutually-exclusive sums, conditional Bayes), then through three statistical gates: BH-FDR-corrected significance, cost-adjusted expected value, and UMA-oracle resolution-risk filter. Public surface shows aggregate counts + per-signal characterizations with event identities redacted. Researcher tier ($49/mo) unlocks the full feed at /research/cme/feed with event slugs, trade construction, day-over-day signal delta, and a daily email digest. Each archive declares its pipeline_version; the V0.2 methodology is tracked in manifest #288046, with public receipt not yet verified and the backtest scheduled around 2026-07-29.
April 28, 2026
Coherent Markets Engine V0.1: constraint-projection structural inference
An observation-side port of the Pennock-Lahaie-Kroer LCMM coherence projection, applied to Polymarket. Reads gamma metadata, builds an event-relationship graph (binary additivity, mutually-exclusive sums, conditional Bayes), projects observed prices onto the constraint-feasible region, and emits Coherence Signals through a three-gate filter: BH-FDR-corrected statistical significance, cost-adjusted expected value (Polymarket fee + slippage + capital lockup), and UMA-oracle resolution risk score. Each signal carries a SHA-256 hash-chain audit-log row. Per the V2.8.2 result tracked in manifest #287983, the Coherent Markets Engine uses market prices as the inference input, not skill-weighted aggregation; the public receipt for that manifest entry is not yet verified. Closest published peer: Saguillo et al. 2025 (IMDEA Networks, AFT 2025, arXiv 2508.03474), retrospective per-pair analysis of $39.6M Polymarket arbitrage; Convexly extends that framing to scheduled live-market n-market polytope projection.
April 27, 2026
MarketAlpha V2.8.2: 24 frozen aggregators, none beat the market on V1-M
Frozen V2.8.2 methodology test tracked in the preregistration manifest (#287436 + #287442 + V1-M cohort substitution amendment #287714) on the V1-M Polymarket cohort; public receipt URLs for those entries are not yet verified. PRIMARY hypothesis (W-EXP beta=4 + a=2.0 + S-RAW skill-weighted aggregation) does not reject H0 vs market-implied baseline (delta_market = +0.179, 95% CI [+0.164, +0.193]) on 6,256 held-out markets. 24-aggregator sweep extends the result: zero of 24 specifications beat the market on the held-out window. Substantive finding: PnL-skill on Polymarket is not a usable forecast-aggregation weight. Reproduces and reinforces V1-M's null finding. Wash-filter equivalence test is tracked in manifest #287983, also with public receipt not yet verified.
April 27, 2026
Edge Score Methodology V1.5: pre-registered E2 + E7 deferred-experiment results
Pre-registered V1.5 follow-up to the V1 paper, AsPredicted #287368, ran 2026-04-27 on the V1-M position tape (8,778 wallets, 542,241 resolved positions). Both primary tests failed ex-ante pre-reg thresholds: E2 per-wallet temporal holdout produced Spearman = +0.111 (95% CI [+0.046, +0.175], N=805) vs threshold +0.30; E7 per-quarter IC stability produced median ρ = +0.038 with 3 of 5 quarters positive vs the 5/6 requirement. Five supplementary analyses surface where V3b holds up (S4 partial Spearman +0.494 controlling for log_capital; S7 cross-category transfer ρ +0.17 to +0.33) and where it does not (S6 persistent-wallet inversion ρ = -0.31 for 229 wallets active 4+ quarters). Honest reframing: V3b is a cross-sectional ranker of wallet behavior, not a per-wallet temporal predictor.
April 27, 2026
Rolling rank-correlation diagnostic: Edge Score V3b vs same-window PnL on the V1-M cohort
A rolling 30-day rank correlation between the V3b composite and signed-log PnL computed on the SAME in-window positions, on the V1-M reference cohort. This is an in-sample contemporaneous diagnostic, NOT a forward test: the score and the realized PnL share the same positions and the dominant input (conviction) is PnL-derived, so it overstates predictive skill. The honest forward result is the pre-registered per-wallet temporal holdout (AsPredicted #287368): out-of-sample Spearman +0.11 (95% CI [0.05, 0.18]), which did not clear the pre-registered +0.30 threshold.
April 27, 2026
Convexly V1 vs Gomez-Cram et al.: methodology comparison
Two independent papers, two different statistical tests, both find concentration of skill on Polymarket within 48 hours of each other. Side-by-side comparison of Convexly V1 (frozen-coefficient OLS, in-sample cross-sectional Spearman +0.514, Fama-French (2010) bootstrap null p < 0.0001 on the 8,656-wallet cohort, published 2026-04-18; a descriptive rank, not forward skill, the per-wallet forward test did not clear its pre-registered +0.30 threshold, with one specification landing below zero) and Gomez-Cram, Guo, Jensen, Kung 'Prediction Market Accuracy: Crowd Wisdom or Informed Minority?' (sign-randomization on 1.72M Polymarket accounts, SSRN 6617059, posted 2026-04-20 revised 2026-04-25).
April 27, 2026
Three Polymarket findings: data hooks for journalists
Three quotable empirical findings from the V1 (8,656 Polymarket wallets) and V1-M (15,106 Manifold users with a 1,647-paired sub-cohort) reference cohorts. PnL leaderboards rank by luck on a Hill alpha 1.28 distribution; concentration differs across the Manifold sweepcash window (causal attribution pending politics-excluded rerun); calibration alone explains 1.5% of who profits. Each reported with confidence interval, statistical test, and methodology link.
April 27, 2026
Posture, not calibration: aligning a pillar with its coefficient
Method-explainer note. The Edge Score posture pillar was originally labelled 'calibration', but its fitted coefficient points the other way. Renaming to 'posture' preserves the measured effect without renaming the math.
April 22, 2026
Edge Score V1-M: methodology extension and cross-venue invariance measurement
18,105,158 Manifold bets, 15,106 users, and a within-user paired comparison across the sweepcash window. The original bundle reported -8.9pp median concentration delta; the 2026-05-04 recovered-cohort politics-excluded rerun reports -9.2pp, 95% CI [-12.8, -3.6], Wilcoxon p = 0.0021 on n=515 of 1,208 paired users with concentration delta defined. The calibration secondary claim is superseded. Pillar coefficients diverge categorically between Polymarket and Manifold, with the discipline pillar flipping sign (permutation p = 0.0001). Hill alpha = 0.86 on Manifold per-user PnL vs 1.28 on Polymarket, non-overlapping 95% CIs.
PDF
April 18, 2026
Edge Score Methodology V1: a composite skill measure for prediction-market traders
A composite skill measure fit on 8,656 Polymarket wallets. Posture plus conviction plus discipline with frozen OLS coefficients, fold-local refit, and a Fama-French bootstrap null (p < 0.0001 on 10,000 permutations).
PDF
April 16, 2026
10,000 Polymarket wallets scored. Calibration barely predicts profit.
A calibration audit of the full Polymarket profit leaderboard. 8,656 wallets, 582,921 resolved positions. Spearman r = +0.148 across the full sample; the top 1% by absolute PnL captures 36.2% of signed profit. Sizing and concentration dominate.
April 11, 2026
Testing the insider-trading claim on Polymarket with Taleb-proof methods
A calibration audit of the top 100 Polymarket profit wallets. The worst-calibrated quartile earns 2.02x more than the best. Spearman r = +0.42, stable under outlier removal. Eight wallets placed the same bet in an eight-day window.

Method-explainer notes ship on a rolling Monday cadence. New papers appear above as they publish.

Where the methodology lives