Methodology comparison · 2026-04-27

Two papers, two tests, one finding

Within 48 hours in April 2026, two independent groups published evidence that skill on Polymarket is concentrated in a small minority of accounts and is statistically distinguishable from luck. The two papers use different cohorts, different statistical tests, and different inference targets. They reach quantitatively consistent conclusions. This page documents the methodology side by side so that academics, journalists, and operators can see exactly where the two converge and where they differ.

The two papers

Convexly V1

Edge Score V3b

Frozen-coefficient composite skill measure on 8,656 Polymarket wallets. Published 2026-04-18 at /research/edge-score-methodology-v1. Public data bundle at /research/v1m/v1m-data-bundle.tar.gz.

Gomez-Cram et al.

Prediction Market Accuracy: Crowd Wisdom or Informed Minority?

Roberto Gomez-Cram, Yunhan Guo, Howard Kung (London Business School), Theis Ingerslev Jensen (Yale). Posted to SSRN 2026-04-20, revised 2026-04-25. Paper ID 6617059, available at papers.ssrn.com.

Side-by-side methodology

DimensionConvexly V1Gomez-Cram et al.
Question askedAmong ranked Polymarket wallets, can a frozen- coefficient composite of three pillars (posture, conviction, discipline) predict out-of-fold signed log PnL?What share of all Polymarket accounts exhibit realized order flow that is informative beyond chance for both short-term price moves and final outcomes?
Sample size8,656 Polymarket wallets with at least five resolved positions, position tape snapshot 2026-04-25 20:18 UTC.1,720,000 Polymarket accounts across 98,906 events and 210,322 markets, $13.76B total trading volume, 2023 through 2025 per CoinDesk via cryptonews.net.
Methodology approachFrozen-coefficient OLS composite. Three z-standardized pillars combined with coefficients committed to a version-controlled repository before validation: posture +0.7876, conviction +2.7220, discipline -1.1508.Per-trader sign-randomization. Each account's full trade history is rerun 10,000 times with the buy/sell direction flipped at random per trade. Realized PnL is compared to the null distribution.
Statistical nullFama-French (2010) bootstrap null with 10,000 PnL permutations. Null hypothesis: no association between the composite and forward PnL.Per-account sign-randomization null with 10,000 reruns. Null hypothesis: account's realized PnL is consistent with random buy/sell direction assignment.
Headline findingOut-of-fold Spearman rank correlation between Edge Score V3b and signed log PnL is +0.514, vs +0.147 for a Brier-only baseline. Bootstrap null p < 0.0001. Top 1% of ranked wallets captures 36.2% of signed profit.3.14% of accounts classified as "skilled winners" with persistent profits and order flow that predicts both next-period prices and final outcomes. A 1pp increase in skilled-trader net buying corresponds to ~8bp increase in probability of correct outcome.
Validation approachOut-of-fold cross-validation across 8,656 wallets with frozen coefficients (no in-sample tuning after the freeze). Subgroup stability: Spearman range +0.47 to +0.73 across six cuts. Cross- venue replication on 15,106 Manifold users at V1-M.Out-of-sample persistence test. 44% of accounts classified as skilled in the training sample remain skilled in the held-out test sample, compared to ~10% in a parallel test on active mutual funds (per The Block summary).
Inference targetWallet-level: can a multi-pillar score rank these 8,656 wallets in a way that predicts forward PnL? Useful for ranking, watchlist construction, smart-money tracking.Population-level: what fraction of all Polymarket accounts are above-chance? Useful for population prevalence estimation and the price-discovery mechanism question.
What it cannot claimEdge Score V1 is a PnL-ranking tool, not a forecast-aggregation weight. It does not claim to identify which specific markets a wallet will be right about; the unit of inference is the wallet, not the position. Coefficients are Polymarket-specific (V1-M shows the discipline coefficient flips sign on Manifold).Sign-randomization is a cross-sectional skill- vs-chance attribution. It does not provide a ranking of skilled accounts within the skilled minority, and it does not claim that skilled accounts are skilled in any specific market or horizon.
ReproducibilityPublic position-tape data bundle at /research/v1m/v1m-data-bundle.tar.gz. Frozen-coefficient commit hash is the reference. V1.5 deferred experiments E2 + E7 (per-wallet temporal holdout, IC stability) ran 2026-04-27 per AsPredicted #287368; both pre-registered primary tests failed their ex-ante thresholds. Full V1.5 result at V1.5 paper.SSRN paper at ID 6617059. Public data bundle status not directly verified at time of writing; Polymarket trade-tape access required to replicate the underlying cohort construction.

Where the two papers converge

  • 01Both papers reject the strong-form "wisdom of the crowd" framing. Aggregation works, but it works because of an informed minority, not the average participant.
  • 02Both papers find that PnL-based skill on Polymarket is statistically distinguishable from luck under a bootstrap-style null with N = 10,000 permutations.
  • 03Both papers find that skill is concentrated in a small minority. Gomez-Cram et al. report 3.14% of accounts as skilled winners. Convexly V1 reports the top 1% of ranked wallets capturing 36.2% of signed profit. The two numbers measure different things, but the direction (skill exists, skill is rare) is the same.
  • 04Both papers were posted within 48 hours of each other. Convexly V1 went live 2026-04-18; Gomez-Cram et al. posted to SSRN 2026-04-20 (revised 2026-04-25). Two independent groups arriving at the same direction via different methodology in the same week is convergent evidence.

Where the two papers differ

The two methodologies are complementary rather than competitive. One is not a falsification test of the other. The papers answer different questions on different cohorts with different statistical tests.

  • 01Cohort scope. Gomez-Cram et al. analyze 1.72M Polymarket accounts; Convexly V1 analyzes a curated universe of 8,656 wallets with at least five resolved positions. Different denominators produce different prevalence estimates by construction.
  • 02Statistical test. Sign-randomization shuffles the trade direction within each account to test whether the realized PnL is consistent with random direction assignment. Fama-French bootstrap shuffles PnL across wallets to test whether the composite-PnL association is consistent with no association. Different nulls; different things ruled out.
  • 03Inference target. Gomez-Cram et al. answer "what share of accounts are above-chance?" Convexly V1 answers "given a curated set of wallets, can a multi-pillar score rank them by forward PnL?" The first question is about population prevalence; the second is about ranking-tool validity.
  • 04V1 cannot claim to be a replication of Gomez-Cram et al., and vice versa. Replication requires the same cohort and the same test. The two papers are independent samples that happen to land on the same side of the question, which is convergent evidence, not replication.

What the convergence enables

Convergent evidence from two independent groups using different methodology is a stronger empirical claim than either paper alone. The standard methodology objection against any single paper is "your null might not be the right null," or "your cohort might be non-representative." That objection becomes weaker when a different cohort under a different null reaches the same direction.

For academics, the convergence sharpens the next-step question. If skill is concentrated and persistent, the follow-up is mechanism: how does information enter prices, what informational advantage do skilled accounts hold, and does the skill rate change across event types or market liquidity tiers. Both papers leave room for this work.

For operators, the convergence has a practical implication. PnL leaderboards are not skill rankings; they are unit-of-luck rankings dominated by tail-event winners on a Hill alpha = 1.28 distribution. A composite-score ranking on the same cohort orders wallets differently. The Truth Leaderboard shows the two rankings side by side; the disagreement is the finding.

Citation note

When citing the convergent evidence externally, the defensible framing is "two independent samples, two different statistical tests, both find concentration of skill on Polymarket within 48 hours of each other."

Avoid "V1 replicates Gomez-Cram et al." or "Gomez-Cram et al. confirm V1." The methods differ; one is not a falsification test of the other.