V3b over V1: model selection under market-selection bias

The two candidates

V1 used z(-brier_score) as the first feature. Brier score is the mean squared error of a wallet's probabilities against realized outcomes, so lower is better. Negating it flips the direction so larger values of the feature mean better fit.

V3b used z(-skill_brier). Skill-Brier is the wallet's baseline Brier minus its observed Brier, where the baseline is computed by always predicting the wallet's own marginal outcome frequency. Skill-Brier is positive when the wallet beats its own trivial baseline; it equals zero when the wallet is indistinguishable from predicting its own base rate. V3b negates the feature for the same sign convention as V1.

The two features correlate strongly but not perfectly. On the training cohort, the pairwise Spearman between brier_score and skill_brier was high enough that the two composites produced nearly identical rankings at the top and bottom of the distribution but diverged in the middle.

The tie

Both candidates were evaluated on the five pre- registered out-of-sample experiments reported in the V1 methodology paper. The primary metric was out-of-fold Spearman correlation between the composite score and signed log-PnL. V3b scored OOF Spearman +0.514. V1 scored +0.507. The 0.007 point gap sat inside a fold-noise band of 0.044 observed across 10-fold cross-validation resamples.

Stopping here would have been a coin flip. The difference is not statistically meaningful in the rank-metric sense. A reviewer asking the question “which composite fits better on this cohort?” would be told “they are indistinguishable.” The question Convexly asked instead was “which composite fails less badly when a new wallet is not like the cohort?”

The mechanism V1 cannot handle

Consider a wallet that only bets on Polymarket markets trading at 99 cents on the yes side. Every bet is a near-certainty. The outcomes almost always resolve in the predicted direction. Brier score for this wallet is near zero by construction. Raw Brier says the wallet is an elite forecaster. Skill-Brier is also near zero because the wallet's marginal frequency is 99 percent in favor of yes, and always predicting 99 percent yes gets you a Brier score close to the one you actually achieved. Skill-Brier correctly reports that the wallet is not beating its own trivial baseline.

A composite that uses raw Brier rewards this wallet. A composite that uses skill-Brier does not. On the training cohort this difference is a second-order effect because most wallets hold a mix of certainties and uncertainties. Off the training cohort, among wallets deliberately selecting the markets they bet on, the difference is first-order.

Market-selection bias is not a hypothetical. It is how a meaningful minority of Polymarket wallets operate. Any composite promoted as a skill measure has to handle wallets that game the base rate by picking only the easy markets. V1 does not; V3b does.

Why a tie-break on robustness is the right move

Two models that tie on the training set are not equal on the population. The tie is a fact about the sample. A principled tie-break uses information from outside the sample. The information available to Convexly Research at the time of model selection was the mechanism by which the features could fail. Raw Brier has a known, reproducible failure mode (market-selection gaming) that skill-Brier does not share. Preferring skill-Brier is preferring the model with the smaller attack surface.

The alternative principle, “pick the simpler model when they tie,” would have chosen V1. V1 is arguably simpler because Brier score requires no baseline computation. Convexly rejected that principle here because the simplicity difference is cosmetic. Computing skill-Brier adds one subtraction at training time and zero additional complexity at inference. The robustness difference is substantive.

What this does not prove

V3b was not chosen because it fit better. It did not. V3b was chosen because the feature underlying it is harder to game. A skeptic is entitled to ask for V3b to beat V1 on a held-out cohort where market-selection gaming is deliberately introduced. That experiment is on the V1.5 follow-up list and is externally pre-registered at AsPredicted #287368. Until V1.5 runs, the V3b preference is a principled decision, not a validated one.

The claim is: on the training cohort alone, V3b and V1 are indistinguishable; Convexly picked V3b because the failure mode it avoids is real and named. A replication on Manifold settlement data will test whether the preference generalizes.

The generalizable lesson

When two candidate models tie inside fold noise, the tiebreaker should be the failure mode, not the fit. Every model fails somewhere. The question at model selection time is where, and whether the failure is plausible in deployment. Raw Brier fails on wallets that pick easy markets. Skill-Brier does not. That is the entire decision.

Read the full methodology paper.

Frozen coefficients, fold-local refit, Fama-French bootstrap null, per-wallet temporal holdout. 13-page PDF with citations and code references.

Read the paper

Related: Posture, not Calibration (the pillar rename that came out of the same post-validation review) and 10,000 Polymarket Wallets Scored (empirical foundation).