p-valuesresearch methodologystatistics

P-Hacking: The Pandemic Nobody Talks About

/ 3 min read / C. Pearson

In 2011, a respected psychology journal published a paper claiming to demonstrate precognition — the ability to perceive future events. The methodology was textbook: pre-registered hypotheses, standard statistical tests, significant p-values.

The paper was, almost certainly, wrong. Not because of fraud, but because of something more insidious: the same methodological flexibility that produced "evidence" of psychic powers is producing "evidence" across every field of science, every day.

Close-up view of a hand holding a pen pointing at data in an open textbook.

What P-Hacking Actually Is

P-hacking is the practice of manipulating data analysis until you get a statistically significant result. It sounds fraudulent. Usually it isn't — it's just researchers making reasonable-seeming decisions that happen to inflate false positive rates:

  • Collecting data until p < 0.05, then stopping
  • Testing multiple outcomes but only reporting the significant one
  • Removing "outliers" that happen to weaken the effect
  • Trying multiple statistical tests and reporting whichever works
  • Splitting data into subgroups until one shows significance
  • Adding or removing covariates until the model cooperates

Each of these decisions might be defensible in isolation. Together, they compound to produce results that look rigorous but describe noise.

The Math Is Unforgiving

At the standard α = 0.05 threshold, you have a 5% chance of a false positive on any single test. Run 20 tests? You expect one false positive. That's not a bug — it's literally what 5% means.

But the problem is worse than simple multiple comparisons. Simmons, Nelson, and Simonsohn showed in their landmark 2011 paper that the "researcher degrees of freedom" available in a typical study — choices about sample size, variables, outlier criteria, analytical approaches — can inflate the false positive rate to over 60% while maintaining the appearance of a single, pre-specified analysis.

Sixty percent. With a threshold that's supposed to guarantee 5%.

Why It Persists

Incentive structures. Careers are built on publications. Publications require significant results. Null results are nearly unpublishable.

A researcher who runs a study, finds nothing, and publishes "we found no effect" gets... nothing. No publication, no citation, no tenure points. The same researcher who runs the same study, massages the analysis until something emerges, and publishes "we found an effect (p = 0.048)" gets a publication, citations, and career advancement.

This isn't a story about bad people. It's a story about a system that rewards exactly the behavior that undermines the scientific method.

What Honest Analysis Looks Like

Pre-registration: Commit to your analysis plan before looking at the data. Publish it publicly. No post-hoc analytical choices.

Report everything: Every outcome you measured, every analysis you ran, every subgroup you examined. Not just the one that worked.

Effect sizes over p-values: A tiny effect with p = 0.001 in a massive sample is less interesting than a large effect with p = 0.08 in a small sample. P-values conflate effect size and sample size. Separate them.

Replicate: A single study, no matter how clean, is a single observation. The replication crisis exists because too many findings were one-off results that nobody tried to reproduce.

The uncomfortable truth is that a huge fraction of published findings — some estimates suggest over 50% — are false positives. Not because science is broken, but because the incentive structure rewards the exact behavior that produces false positives.

Fix the incentives, fix the science. Until then, read every "significant result" with healthy skepticism.

p < 0.05 isn't a truth certificate. It's a filter with a 5% leak rate — at best.

Get Mean Methods in your inbox

New posts delivered directly. No spam.

No spam. Unsubscribe anytime.

Related Reading