Why not a t-test or Sharpe ratio?

Parametric tests assume finite variance. On Polymarket wallet PnL, the Hill estimator gives alpha = 1.28, which means variance is formally infinite on the empirical distribution. Under infinite variance, t-statistics, Sharpe ratios, and OLS R-squared confidence intervals are not well-behaved. Rank-based statistics (Spearman correlation, Mann-Whitney U, Kendall's tau) use the empirical ranks of the joint distribution, whose sampling distribution is bounded by construction and remains valid under fat tails.

What is a Fama-French bootstrap null?

Fama and French 2010 introduced a permutation protocol for distinguishing mutual fund skill from luck: permute the fund-period PnL labels, re-rank funds, and compare the observed skill statistic to the permutation distribution. The test is non-parametric and does not require assumptions about PnL distribution shape. Convexly adapted the protocol to cross-sectional prediction-market wallets by permuting wallet-PnL labels while holding the Edge Score composite fixed, then comparing the observed Spearman rank correlation to the 10,000-sample permutation distribution. Details in §5.6 of the methodology paper.

How do I tell if a single wallet has repeatable skill or noisy PnL?

This is harder than the population-level question. For a single wallet, repeatability inference needs either long time-series data with purging and embargo to control for look-ahead bias, or a peer-cohort persistence test (Forsberg, Gallagher, Warren 2021 framework). The Convexly wallet analyzer gives you a percentile rank against 8,656 peers, which addresses the cross-sectional question. It does not prove per-wallet temporal persistence: the V1.5 per-wallet temporal holdout, run 2026-04-27, returned a forward Spearman of +0.11 (95% CI [0.05, 0.18]), below the +0.30 bar set in advance.

Does Edge Score separate skill from luck?

At the population level, yes: the Edge Score composite achieves out-of-fold Spearman +0.514 against signed log PnL on an 8,656-wallet cohort, with the Fama-French bootstrap null rejected at p < 0.0001 on 10,000 permutations. At the individual-wallet level, the score is a percentile rank on the skill dimension; individual-PnL inference is still subject to Hill alpha = 1.28 fat-tail variance. The score separates median outcomes, not expected returns for any specific trader. In its one out-of-sample forward test, a pre-set per-wallet temporal holdout, the frozen composite held a Spearman of +0.11 (95% CI [0.05, 0.18]) with forward PnL, below the +0.30 bar set in advance; treat Edge Score as a descriptive behavioral profile, not a validated skill ranking.

How to Distinguish Skill from Luck in Trading PnL (Fama-French Bootstrap, Applied to Polymarket)

Q: How do you distinguish skill from luck in trading PnL?

The standard protocol is a Fama-French (2010) bootstrap null model: permute the PnL labels across traders and recompute whatever skill statistic you are measuring. If the observed statistic lies outside the permutation distribution, the null of zero skill is rejected. Convexly applied this protocol to 8,656 Polymarket wallets with the Edge Score composite as the skill statistic, running 10,000 permutations. The observed out-of-fold Spearman of +0.514 against signed log PnL lies outside every permuted sample; one-sided p < 0.0001 (in-sample; forward test +0.11 [0.05, 0.18], below the +0.30 bar).

The standard protocol for distinguishing skill from luck in trading PnL is the Fama-French (2010) bootstrap null model: permute PnL labels across traders and compare the observed skill statistic to the permutation distribution. If the observed statistic sits outside the permuted samples, the null of zero skill is rejected. Convexly applied this protocol to 8,656 Polymarket wallets with 10,000 PnL permutations. The observed out-of-fold Spearman of +0.514 lies outside every permuted sample; one-sided p < 0.0001 (in-sample; forward test +0.11 [0.05, 0.18], below the +0.30 bar).

Why parametric tests fail here

Prediction-market PnL is fat-tailed. The Hill estimator on 8,656 Polymarket wallets returns alpha = 1.28 with a 95% confidence interval of 1.20 to 1.36. Under alpha below 2, the variance of the distribution is formally infinite. This matters because t-statistics, Sharpe ratios, OLS R-squared confidence bands, and anything else that depends on a finite second moment becomes unreliable.

Rank-based statistics are different. Spearman rank correlation is computed on empirical ranks, which are bounded between 1 and N by construction. The sampling distribution of a rank statistic is well-behaved even when the underlying variable has infinite theoretical variance. This is why the Convexly methodology paper uses Spearman throughout and reports no Pearson correlation or t-statistic on realized PnL.

How the bootstrap works

The null hypothesis is that Edge Score has zero association with signed log PnL. To test it:

Hold the Edge Score vector fixed for all 8,656 wallets
Permute the wallet-PnL labels uniformly at random (so a wallet keeps its Edge Score but is assigned a random wallet's PnL)
Recompute the Spearman rank correlation between the Edge Score vector and the permuted PnL vector
Repeat 10,000 times, building the permutation distribution of Spearman values
Compare the observed (unpermuted) Spearman +0.514 against the permutation distribution. Under the null, the observed Spearman should sit inside the distribution. Under the alternative, it sits outside. (This +0.514 is in-sample; the forward test held +0.11 [0.05, 0.18], below the +0.30 bar.)

The observed Spearman of +0.514 sits outside every one of the 10,000 permuted samples. The implied one-sided p-value is less than 1 in 10,000, or p < 0.0001. This rejects the null of zero in-sample, cross-sectional association at the population level. It is not the same as forward skill: the per-wallet temporal holdout we committed to in advance held only an out-of-sample Spearman of +0.11 (95% CI 0.05 to 0.18) with forward PnL and did not clear the +0.30 threshold it was filed against. Read +0.514 as a cross-sectional association on this cohort, not as proof of repeatable forward skill for any individual wallet.

What the test does and does not prove

The bootstrap rejects the null of zero population-level association. It does not:

Prove individual wallets are skilled (the V1.5 per-wallet temporal holdout, run 2026-04-27, returned forward Spearman +0.11 [0.05, 0.18], below the pre-set +0.30 bar)
Predict the expected return of the top-scoring wallets (under Hill alpha = 1.28, individual realized PnL is infinite-variance and the composite separates median outcomes, not expected values)
Establish causation from calibration, conviction, or discipline to profit (cross-sectional correlation only)

These are real limitations. The V1.5 temporal-holdout experiments ran on 2026-04-27, and the primary per-wallet forward test missed its pre-set bar: forward Spearman +0.11 (95% CI [0.05, 0.18]) against the +0.30 threshold filed in advance. Treat Edge Score as a descriptive behavioral profile, not a validated skill ranking.

Reference and reproducibility

The bootstrap implementation follows Fama and French (2010), “Luck versus Skill in the Cross-Section of Mutual Fund Returns,” Journal of Finance 65(5). Convexly's adaptation substitutes prediction-market wallets for mutual funds and Edge Score composite for fund alpha. The full protocol lives in §5.6 of the methodology paper with the permutation-validity defense in §7 (exchangeability holds under the null; the tail structure of the PnL marginal is preserved because the operation permutes labels, not values).

Skill vs Luck in Trading PnL

Why parametric tests fail here

How the bootstrap works

What the test does and does not prove

Reference and reproducibility

Related questions

Related reading