In-sample rank-correlation diagnostic

Edge Score V3b vs same-window PnL: an in-sample diagnostic, not a forward test

Read this first: target leakage, not a forward test

The rolling series below is an in-sample contemporaneous rank correlation (the Edge Score and the realized PnL are computed on the same in-window positions, and the dominant input, conviction = concentration = max|pnl|/total_risk, is PnL-derived). It is NOT a forward test; it overstates predictive skill and should be read as a descriptive diagnostic only.

The honest forward result is the pre-registered per-wallet temporal holdout (AsPredicted #287368): the frozen-coefficient Edge Score held a small positive out-of-sample Spearman of +0.11 (95% CI [0.05, 0.18]) with forward PnL. This did not clear the pre-registered +0.30 threshold, and a refit did not generalize. Forward predictive skill is small and not yet established at the pre-registered bar.

Across 26 rolling 30-day windows from October 1, 2025 through April 25, 2026 on the V1-M reference cohort, the V3b composite (frozen coefficients posture +0.7876, conviction +2.7220, discipline -1.1508) produced a mean rolling Spearman of +0.391 and a median of +0.375 against signed-log PnL computed on the same in-window positions. 26 of 26 windows finished above the Brier-only baseline of +0.147. This is an in-sample contemporaneous diagnostic, not a forward test: the score and the realized PnL share the same positions and the dominant input (conviction) is PnL-derived, so it overstates predictive skill. The honest forward result is the pre-registered per-wallet temporal holdout (AsPredicted #287368): out-of-sample Spearman +0.11 (95% CI [0.05, 0.18]), which did not clear the pre-registered +0.30 threshold.

Rolling Spearman (in-sample, same-window PnL)

30-day windows, stepped every 7 days, 10-position minimum per wallet. Score and PnL computed on the same in-window positions (in-sample contemporaneous, not forward). Shaded band is the 95% percentile bootstrap CI (1,000 resamples per window).

Mean rolling Spearman

+0.391

across 26 windows

Best window

+0.520

n=724 wallets, ending 2025-11-07

Worst window

+0.299

n=1443 wallets, ending 2026-04-24

How to read this chart

The teal line is the Spearman correlation between the V3b composite and signed-log PnL within each 30-day window. Each window is scored independently using only the positions that resolved inside it, and the same positions also produce the PnL the score is correlated against. Because the score and the target share positions (and conviction is PnL-derived), the correlation is contemporaneous and in-sample, not forward. Wallets with fewer than 10 resolved positions in a window are excluded for that window.
The shaded band is the 95% percentile bootstrap CI. A narrow band means the rho estimate for that window is precise; a wide band means the cohort in that window was small or noisy. Both are reported, never just the point estimate, per the credibility-claim audit rules.
The amber dashed line at +0.514 is the in-sample out-of-fold Spearman published in the V1 paper. Both that benchmark and the rolling series here are in-sample measures, so windows at or above the line are not evidence of forward skill.
The red dashed line at +0.147 is the Brier-only baseline (calibration alone vs PnL) published in the 10K-wallet study. Windows that finish above this line are adding signal beyond raw calibration.
The vertical dashed line marks the Control H blinding cutoff at 2025-09-30. The V3b coefficients were fit on a snapshot through April 15-16 2026. The cutoff is retained for context, but it does not make any segment a forward test: every window correlates the score against PnL from the same in-window positions, so the leakage is present on both sides of the cutoff.

Methodology

Model: Edge Score V3b (frozen production coefficients).
Coefficients: posture 0.7876, conviction 2.7220, discipline -1.1508. Frozen at the V1 paper publication and unchanged.
Cohort: V1-M reference cohort, n_unique_positions >= 5 in wallet_analysis_20260425_201800.csv.
Window: 30-day rolling, stepped every 7 days, with a minimum of 10 resolved positions per wallet per window.
Per-window features: skill_brier is computed window-locally as the wallet’s baseline Brier (under the window’s marginal frequency of wins) minus its observed Brier on the window’s positions. concentration is the largest absolute per-position PnL share over total in-window risk. n_unique_positions is the count of resolved positions inside the window.
Score: Edge Score V3b raw composite under the frozen production z-score parameters. Spearman is rank-invariant, so the unbounded raw score is equivalent to the production percentile mapping for correlation purposes.
Confidence interval: 95% percentile bootstrap on (Edge Score, signed-log PnL) pairs, 1,000 resamples per window.
Source data: services/api/scripts/output/wallet_all_positions_20260425_201800.csv (V1-M position tape, 542,241 resolved positions) and services/api/scripts/output/wallet_analysis_20260425_201800.csv (per-wallet features for the cohort filter).
Generator: services/api/scripts/forward_validation_rolling_spearman.py. Generated Mon, 27 Apr 2026 23:11:51 GMT.
Underlying JSON: forward-validation-rolling-spearman.json (one record per window; same schema as the chart data).

Limits of this dashboard

This is not a forward test. The score and the realized PnL are computed from the same in-window positions, and the dominant input (conviction = concentration = max|pnl|/total_risk) is PnL-derived. The series therefore has target leakage and overstates predictive skill. Read it as a descriptive contemporaneous diagnostic, not as out-of-sample predictive validation.
Survivor bias still applies. The cohort comes from the leaderboard-ranked snapshot; wallets that blew up and were delisted before the snapshot are absent from every window.
Window-local skill_brier and the conviction and discipline pillars are computed window-locally, but this does not make the series forward: conviction is derived from in-window PnL, so the score is mechanically correlated with the same-window PnL it is scored against. The z-score reference statistics come from the V1 training cohort and are kept frozen here.
The min-positions-per-wallet filter excludes thin slices and cuts noise, but it also concentrates the held-out estimate on high-activity wallets. The cohort sizes per window are surfaced in the tooltip and in the underlying JSON.
A point estimate from a single window is not a methodology verdict, and neither is this whole series: it is in-sample. The real out-of-sample question is answered by the pre-registered per-wallet temporal holdout (AsPredicted #287368), which produced an out-of-sample Spearman of +0.11 (95% CI [0.05, 0.18]) with forward PnL and did not clear the pre-registered +0.30 threshold. See the V1.5 result for the forward finding.

The FDR-screened wallet registry and its forward test

The composite above is one lens. A separate, simpler question is whether any individual wallet’s resolved record beats chance after multiplicity correction. In a retrospective screen of per-wallet resolved Polymarket records (frozen 2026-04-25 position tape), 178 of 3,871 wallets cleared a Benjamini-Hochberg screen at q = 0.10, a cleared proportion of 4.6% (95% Wilson interval [3.98%, 5.30%]). At q = 0.10, the expected number of false discoveries among the 178 cleared wallets is at most 17.8. The screen is in-sample and retrospective: it identifies records that were unlikely under chance, and it does not establish that those wallets keep an edge going forward.

Whether the cleared set keeps its edge is the subject of a strictly prospective test, pre-registered as AsPredicted #294147 before the forward window opened. The candidate set (178 wallets), the control set (3,693 wallets), and the methodology were frozen before any forward evidence accrued. The forward window runs 2026-06-02 through 2026-08-30 and the test matures on time and outcomes alone.

Qualifying candidate wallets

34 / 40

floor of 40 not yet met

Resolved candidate positions

8,265 / 1,000

floor of 1,000 met

Days to window end

window ends 2026-08-30

Current verdict

insufficient_sample

as of 2026-06-18 UTC

This page reports accrual counts only, never interim pooled effect estimates, while the window is open. The verdict stays insufficient_sample until at least 40 candidate wallets clear the 10-position floor and at least 1,000 qualifying candidate forward positions resolve. A null or inconclusive outcome at maturity is a valid pre-registered result and will be reported as such.

Registry dataset

The frozen per-wallet registry behind this screen is being prepared for public release and is pending a data-release review. Nothing is downloadable today. Leave an email and the dataset link is emailed after release.

References

V1 paper: Edge Score Methodology V1 (in-sample OOF Spearman +0.514, Fama-French bootstrap null p < 0.0001).
V1-M paper: Edge Score V1-M cross-venue extension (defines the V1-M reference cohort used here).
Brier-only baseline: 10,000-wallet Polymarket study (Spearman +0.147 between Brier and signed-log PnL).