Forward-validation dashboard

Edge Score V3b: rolling forward-validation on the V1-M cohort

Across 26 rolling 30-day windows from October 1, 2025 through April 25, 2026 on the V1-M reference cohort, the V3b composite (frozen coefficients posture +0.7876, conviction +2.7220, discipline -1.1508) produced a mean rolling Spearman of +0.391 and a median of +0.375. 26 of 26 windows finished above the Brier-only baseline of +0.147. Every window cleared the Brier-only baseline, and the rolling estimate stays above zero across the entire held-out period.

Rolling Spearman

30-day windows, stepped every 7 days, 10-position minimum per wallet. Shaded band is the 95% percentile bootstrap CI (1,000 resamples per window).

Mean rolling Spearman
+0.391
across 26 windows
Best window
+0.520
n=724 wallets, ending 2025-11-07
Worst window
+0.299
n=1443 wallets, ending 2026-04-24

How to read this chart

  • The teal line is the Spearman correlation between the V3b composite and signed-log PnL within each 30-day window. Each window is scored independently using only the positions that resolved inside it. Wallets with fewer than 10 resolved positions in a window are excluded for that window.
  • The shaded band is the 95% percentile bootstrap CI. A narrow band means the rho estimate for that window is precise; a wide band means the cohort in that window was small or noisy. Both are reported, never just the point estimate, per the credibility-claim audit rules.
  • The amber dashed line at +0.514 is the in-sample OOF Spearman published in the V1 paper. Held-out windows that match or exceed this line are running at or above the in-sample benchmark.
  • The red dashed line at +0.147 is the Brier-only baseline (calibration alone vs PnL) published in the 10K-wallet study. Windows that finish above this line are adding signal beyond raw calibration.
  • The vertical dashed line marks the Control H blinding cutoff at 2025-09-30. The V3b coefficients were fit on a snapshot through April 15-16 2026; windows that overlap the fitting period are not strictly out-of-sample. The cutoff is shown so the reader can separate the overlapping segment from the pure forward segment.

Methodology

Model
Edge Score V3b (frozen production coefficients).
Coefficients
posture 0.7876, conviction 2.7220, discipline -1.1508. Frozen at the V1 paper publication and unchanged.
Cohort
V1-M reference cohort, n_unique_positions >= 5 in wallet_analysis_20260425_201800.csv.
Window
30-day rolling, stepped every 7 days, with a minimum of 10 resolved positions per wallet per window.
Per-window features
skill_brier is computed window-locally as the wallet’s baseline Brier (under the window’s marginal frequency of wins) minus its observed Brier on the window’s positions. concentration is the largest absolute per-position PnL share over total in-window risk. n_unique_positions is the count of resolved positions inside the window.
Score
Edge Score V3b raw composite under the frozen production z-score parameters. Spearman is rank-invariant, so the unbounded raw score is equivalent to the production percentile mapping for correlation purposes.
Confidence interval
95% percentile bootstrap on (Edge Score, signed-log PnL) pairs, 1,000 resamples per window.
Source data
services/api/scripts/output/wallet_all_positions_20260425_201800.csv (V1-M position tape, 542,241 resolved positions) and services/api/scripts/output/wallet_analysis_20260425_201800.csv (per-wallet features for the cohort filter).
Generator
services/api/scripts/forward_validation_rolling_spearman.py. Generated Mon, 27 Apr 2026 23:11:51 GMT.
Underlying JSON
forward-validation-rolling-spearman.json (one record per window; same schema as the chart data).

Limits of this dashboard

  • Survivor bias still applies. The cohort comes from the leaderboard-ranked snapshot; wallets that blew up and were delisted before the snapshot are absent from every window.
  • Window-local skill_brier is a fair forward proxy because it uses only positions inside the window. The conviction and discipline pillars are also computed window-locally. The z-score reference statistics, however, come from the V1 training cohort and are kept frozen here. Refitting them per window would change the question being asked.
  • The min-positions-per-wallet filter excludes thin slices and cuts noise, but it also concentrates the held-out estimate on high-activity wallets. The cohort sizes per window are surfaced in the tooltip and in the underlying JSON.
  • A point estimate from a single window is not a methodology verdict. Multiple consecutive windows above the Brier-only baseline, with non-overlapping CIs above zero, is the right standard. The summary above states only what the data shows across the full held-out period.

References