Forward-validation dashboard
Edge Score V3b: rolling forward-validation on the V1-M cohort
Across 26 rolling 30-day windows from October 1, 2025 through April 25, 2026 on the V1-M reference cohort, the V3b composite (frozen coefficients posture +0.7876, conviction +2.7220, discipline -1.1508) produced a mean rolling Spearman of +0.391 and a median of +0.375. 26 of 26 windows finished above the Brier-only baseline of +0.147. Every window cleared the Brier-only baseline, and the rolling estimate stays above zero across the entire held-out period.
Rolling Spearman
30-day windows, stepped every 7 days, 10-position minimum per wallet. Shaded band is the 95% percentile bootstrap CI (1,000 resamples per window).
Mean rolling Spearman
+0.391
across 26 windows
Best window
+0.520
n=724 wallets, ending 2025-11-07
Worst window
+0.299
n=1443 wallets, ending 2026-04-24
How to read this chart
- The teal line is the Spearman correlation between the V3b composite and signed-log PnL within each 30-day window. Each window is scored independently using only the positions that resolved inside it. Wallets with fewer than 10 resolved positions in a window are excluded for that window.
- The shaded band is the 95% percentile bootstrap CI. A narrow band means the rho estimate for that window is precise; a wide band means the cohort in that window was small or noisy. Both are reported, never just the point estimate, per the credibility-claim audit rules.
- The amber dashed line at +0.514 is the in-sample OOF Spearman published in the V1 paper. Held-out windows that match or exceed this line are running at or above the in-sample benchmark.
- The red dashed line at +0.147 is the Brier-only baseline (calibration alone vs PnL) published in the 10K-wallet study. Windows that finish above this line are adding signal beyond raw calibration.
- The vertical dashed line marks the Control H blinding cutoff at 2025-09-30. The V3b coefficients were fit on a snapshot through April 15-16 2026; windows that overlap the fitting period are not strictly out-of-sample. The cutoff is shown so the reader can separate the overlapping segment from the pure forward segment.
Methodology
- Model
- Edge Score V3b (frozen production coefficients).
- Coefficients
- posture 0.7876, conviction 2.7220, discipline -1.1508. Frozen at the V1 paper publication and unchanged.
- Cohort
- V1-M reference cohort, n_unique_positions >= 5 in wallet_analysis_20260425_201800.csv.
- Window
- 30-day rolling, stepped every 7 days, with a minimum of 10 resolved positions per wallet per window.
- Per-window features
- skill_brier is computed window-locally as the wallet’s baseline Brier (under the window’s marginal frequency of wins) minus its observed Brier on the window’s positions. concentration is the largest absolute per-position PnL share over total in-window risk. n_unique_positions is the count of resolved positions inside the window.
- Score
- Edge Score V3b raw composite under the frozen production z-score parameters. Spearman is rank-invariant, so the unbounded raw score is equivalent to the production percentile mapping for correlation purposes.
- Confidence interval
- 95% percentile bootstrap on (Edge Score, signed-log PnL) pairs, 1,000 resamples per window.
- Source data
-
services/api/scripts/output/wallet_all_positions_20260425_201800.csv(V1-M position tape, 542,241 resolved positions) andservices/api/scripts/output/wallet_analysis_20260425_201800.csv(per-wallet features for the cohort filter). - Generator
-
services/api/scripts/forward_validation_rolling_spearman.py. Generated Mon, 27 Apr 2026 23:11:51 GMT. - Underlying JSON
- forward-validation-rolling-spearman.json (one record per window; same schema as the chart data).
Limits of this dashboard
- Survivor bias still applies. The cohort comes from the leaderboard-ranked snapshot; wallets that blew up and were delisted before the snapshot are absent from every window.
- Window-local skill_brier is a fair forward proxy because it uses only positions inside the window. The conviction and discipline pillars are also computed window-locally. The z-score reference statistics, however, come from the V1 training cohort and are kept frozen here. Refitting them per window would change the question being asked.
- The min-positions-per-wallet filter excludes thin slices and cuts noise, but it also concentrates the held-out estimate on high-activity wallets. The cohort sizes per window are surfaced in the tooltip and in the underlying JSON.
- A point estimate from a single window is not a methodology verdict. Multiple consecutive windows above the Brier-only baseline, with non-overlapping CIs above zero, is the right standard. The summary above states only what the data shows across the full held-out period.
References
- V1 paper: Edge Score Methodology V1 (in-sample OOF Spearman +0.514, Fama-French bootstrap null p < 0.0001).
- V1-M paper: Edge Score V1-M cross-venue extension (defines the V1-M reference cohort used here).
- Brier-only baseline: 10,000-wallet Polymarket study (Spearman +0.147 between Brier and signed-log PnL).