Learn

What is a negative control?

Run the same skill test on inputs that should show nothing. If random, size-matched wallet cohorts light up the machinery at the same rate as the cohort under review, the finding was never a finding.

The answer first

A negative control is the placebo arm of a measurement. You run the identical test on inputs where the effect should be absent, and whatever the pipeline reports there is your chance baseline. A drug trial gives sugar pills to a control group; a lab assay runs the reagents with no sample; a wallet-cohort audit runs the identical realized-edge test, with the identical FDR correction, on random cohorts of the same size as the cohort under review. A real finding has to beat that baseline, not just beat zero.

The reason this matters for wallet cohorts specifically: cohorts are almost always selected after outcomes are known. "Audit these 20 winners" is an outcome-conditioned sample, and outcome-conditioning inflates the apparent skilled-rate all by itself. The negative control is the anchor that separates "this cohort is unusual" from "any random handful of wallets looks like this".

How the baseline is built

In Convexly's enterprise cohort audits, the control is constructed like this:

  • Draw 500 random cohorts, each exactly the size of the cohort under review, from a pool of scoreable wallets, with the reviewed wallets excluded. The PRNG seed is recorded, so the draws are reproducible.
  • Run each draw through the identical pipeline: realized entry edge with a BCa 95 percent interval, the concentration screen, and the Benjamini-Hochberg correction.
  • Report the mean skilled-rate across the 500 draws (the chance baseline) and an empirical p: the fraction of random draws whose FDR survivor count is at least the reviewed cohort's.

One disclosure travels with the number. The random pool is restricted to data-rich, scoreable wallets (at least 30 resolved positions with a usable interval), because a wallet with too few resolved positions cannot be put through the same test. The baseline is therefore a baseline over scoreable wallets, not the full address space, and that restriction is conservative: it makes the random draws clear the test at least as often as a full-address-space draw would, so the reviewed cohort's separation gets harder to claim, never easier.

Worked example: the chance arithmetic on our own cohort

The simplest negative-control logic needs no simulation at all. In the frozen 2026-06-09 scan of our own published top-50 cohort, 35 wallets were testable at a 2.5 percent one-sided threshold, so under a null of zero skill the expected number of uncorrected positives is 35 × 0.025 = 0.875, call it about 0.9. Observed: exactly 1 of 35, which is what noise predicts, and it did not survive the correction at q = 0.10. The cohort's result matches its own chance baseline, and that is exactly what we published at /research/top50-skill-scan. The 500-draw control generalizes the same question to cohorts where the answer is less clean.

The honest-null rule

When the control cannot be computed (no random pool was supplied to a run, or the pool is smaller than the cohort), it is reported as null with the reason stated, never fabricated. A synthesized baseline would quietly poison every number that leans on it, and a reader who cannot see which runs had a real control cannot trust any of them. The same rule governs the rest of the pipeline: an inconclusive result is a publishable result.

Where it is used

The size-matched negative control is a standing component of the enterprise cohort audit, where it anchors the skilled-rate and survivor count of every client-supplied cohort against chance. The underlying per-wallet statistic and the correction it pairs with are covered in the realized-edge and false-discovery-rate explainers.

Convexly publishes new methodology research roughly every 6-8 weeks plus the /learn series on a rolling cadence. Get the next paper in your inbox when it ships:

Frequently asked

What is a negative control in statistics?
A test run on inputs where the effect being measured should be absent. If the measurement pipeline reports an effect on those inputs anyway, the pipeline's positives cannot be trusted. In wallet-skill work the negative control is a set of random, size-matched wallet cohorts run through the identical test as the cohort under review.
Why does a wallet cohort audit need one?
Because cohorts are usually selected AFTER outcomes are known: wallets that already won, wallets a client already follows. Selecting on outcomes inflates the apparent skilled-rate. The negative control anchors the reading: if 500 random size-matched cohorts produce a similar survivor count to the cohort under review, the review found selection, not skill.
How does Convexly compute the negative control?
In enterprise cohort work: draw 500 random cohorts of the same size as the contracted cohort (contracted wallets excluded) from a pool of scoreable wallets, run the identical realized-edge test with the identical FDR correction on each draw, and report the mean skilled-rate across draws plus an empirical p, the fraction of random draws whose FDR survivor count is at least the contracted cohort's. The PRNG seed is recorded so the draws are reproducible.
What is the pool restriction, and why is it disclosed?
The random pool contains only data-rich, scoreable wallets (at least 30 resolved positions with a usable bootstrap interval), because a wallet with too few resolved positions cannot be put through the same test. The chance baseline is therefore a baseline over scoreable wallets, not the full address space. That restriction biases the baseline conservatively: the random draws clear the test at least as often as a draw from the full address space would, making the reviewed cohort's separation harder to claim, never easier.
What happens when the control cannot be computed?
It is reported as null with the reason (for example, no random pool was supplied to the run), never fabricated. An honest null is a valid result; a synthesized baseline would poison every number that leans on it.

Related explainers

Related reading

ResearchNegative results

LearnBrier score

LearnCalibration

LearnConcentration flag