Skill vs Luck in Trading PnL

The standard protocol for distinguishing skill from luck in trading PnL is the Fama-French (2010) bootstrap null model: permute PnL labels across traders and compare the observed skill statistic to the permutation distribution. If the observed statistic sits outside the permuted samples, the null of zero skill is rejected. Convexly applied this protocol to 8,656 Polymarket wallets with 10,000 PnL permutations. The observed out-of-fold Spearman of +0.514 lies outside every permuted sample; one-sided p < 0.0001.

Why parametric tests fail here

Prediction-market PnL is fat-tailed. The Hill estimator on 8,656 Polymarket wallets returns alpha = 1.28 with a 95% confidence interval of 1.20 to 1.36. Under alpha below 2, the variance of the distribution is formally infinite. This matters because t-statistics, Sharpe ratios, OLS R-squared confidence bands, and anything else that depends on a finite second moment becomes unreliable.

Rank-based statistics are different. Spearman rank correlation is computed on empirical ranks, which are bounded between 1 and N by construction. The sampling distribution of a rank statistic is well-behaved even when the underlying variable has infinite theoretical variance. This is why the Convexly methodology paper uses Spearman throughout and reports no Pearson correlation or t-statistic on realized PnL.

How the bootstrap works

The null hypothesis is that Edge Score has zero association with signed log PnL. To test it:

  1. Hold the Edge Score vector fixed for all 8,656 wallets
  2. Permute the wallet-PnL labels uniformly at random (so a wallet keeps its Edge Score but is assigned a random wallet's PnL)
  3. Recompute the Spearman rank correlation between the Edge Score vector and the permuted PnL vector
  4. Repeat 10,000 times, building the permutation distribution of Spearman values
  5. Compare the observed (unpermuted) Spearman +0.514 against the permutation distribution. Under the null, the observed Spearman should sit inside the distribution. Under the alternative, it sits outside.

The observed Spearman of +0.514 sits outside every one of the 10,000 permuted samples. The implied one-sided p-value is less than 1 in 10,000, or p < 0.0001. Edge Score's association with realized PnL is not luck at the population level.

What the test does and does not prove

The bootstrap rejects the null of zero population-level association. It does not:

  • Prove individual wallets are skilled (that needs per-wallet temporal holdout, pre-registered for V1.5)
  • Predict the expected return of the top-scoring wallets (under Hill alpha = 1.28, individual realized PnL is infinite-variance and the composite separates median outcomes, not expected values)
  • Establish causation from calibration, conviction, or discipline to profit (cross-sectional correlation only)

These are real limitations. V1.5 ships 2026-05-08 with a per-wallet temporal holdout experiment and a gaming- injected control cohort to address the individual- wallet and causation questions in turn.

Reference and reproducibility

The bootstrap implementation follows Fama and French (2010), “Luck versus Skill in the Cross-Section of Mutual Fund Returns,” Journal of Finance 65(5). Convexly's adaptation substitutes prediction-market wallets for mutual funds and Edge Score composite for fund alpha. The full protocol lives in §5.6 of the methodology paper with the permutation-validity defense in §7 (exchangeability holds under the null; the tail structure of the PnL marginal is preserved because the operation permutes labels, not values).

Read the full protocol

13-page methodology paper with the Fama-French bootstrap in §5.6, Hill estimator in §5.5, and permutation- validity defense in §7.

Read the methodology paper

Related questions

What is a Brier score for prediction markets?

How do position sizing diagnostics work on Polymarket?

Is there a free Polymarket calibration tool?