statisticsab-testingmultiple-comparisonsfalse-discovery-rate

Multiple Comparisons Problem: Why Your A/B Tests Are Lying to You

C. Pearson C. Pearson
/ / 4 min read

Multiple Comparisons Problem: Why Your A/B Tests Are Lying to You

Desk with colorful graphs, sticky notes, and a marker, perfect for data analysis themes. Photo by RDNE Stock project on Pexels.

You run 20 A/B tests at α = 0.05. Three show "statistically significant" results. Your marketing team celebrates. Your product team ships features. Everyone's happy.

Except you just got played by probability itself.

The Dirty Math Behind False Discoveries

When you set α = 0.05, you're accepting a 5% chance of a false positive — finding significance where none exists. That's fine for a single test. But probability doesn't care about your testing schedule.

With 20 independent tests at α = 0.05, the probability of getting at least one false positive isn't 5%. It's 64%.

Here's the math: P(at least one false positive) = 1 - (0.95)^20 = 0.64

Those three "significant" results? Statistically, you should expect 1-3 false positives just from random noise. Your celebration might be premature.

Why Smart Teams Keep Making This Mistake

The multiple comparisons problem feels counterintuitive because we think about each test in isolation. Marketing tests email subject lines. Product tests button colors. Data science tests recommendation algorithms.

Each team uses proper statistical methods. Each follows best practices. Yet collectively, they're generating false discoveries at an alarming rate.

flowchart TD
    A[Run Multiple Tests] --> B{Each Test Significant?}
    B -->|Yes| C[Celebrate Success]
    B -->|No| D[Continue Testing]
    C --> E[Ship Changes]
    E --> F[Measure Impact]
    F --> G{Real Effect?}
    G -->|No| H[False Discovery]
    G -->|Yes| I[True Discovery]
    H --> J[Wasted Resources]

The issue compounds when teams don't coordinate. Product might run 10 tests this quarter. Marketing runs 15. Engineering runs 8. Nobody's tracking the collective false discovery rate.

The Bonferroni Band-Aid (And Why It Sucks)

The textbook solution is Bonferroni correction: divide your α by the number of tests. With 20 tests, use α = 0.05/20 = 0.0025.

This controls your family-wise error rate. Problem solved, right?

Not quite. Bonferroni is brutally conservative. It assumes all tests are independent (rarely true) and treats exploratory research the same as confirmatory studies. You'll miss real effects while chasing statistical purity.

Worse, it encourages gaming. Teams split tests across quarters to avoid correction. They run "pilot studies" that mysteriously become final results. The method becomes the message.

Better Approaches for Real Organizations

False Discovery Rate (FDR) control offers a more practical alternative. Instead of controlling the probability of any false positive, FDR controls the expected proportion of false discoveries among your significant results.

The Benjamini-Hochberg procedure is surprisingly simple:

  1. Rank all p-values from smallest to largest
  2. Find the largest p-value where p ≤ (rank/total tests) × α
  3. Reject all hypotheses up to that point

This method adapts to your data. When you have strong signals, it's less conservative. When effects are weak, it tightens up.

Pre-registration and hypothesis hierarchies provide another layer of protection. Primary hypotheses get full statistical power. Secondary analyses use adjusted thresholds. Exploratory findings are labeled as such.

Some teams implement testing budgets — allocating statistical power across projects like financial resources. High-priority tests get lower α thresholds. Exploratory work uses higher ones.

The Uncomfortable Truth About "Significant" Results

Even with corrections, statistical significance doesn't guarantee practical importance. A 0.1% improvement in click-through rates might reach p < 0.001 with sufficient sample size, but implementing it could cost more than it generates.

Effect size matters more than p-values. Confidence intervals tell better stories than binary significance tests. Replication beats single studies, no matter how impressive the statistics look.

Your testing program needs guard rails, not just good intentions. Track how many tests you're running. Monitor your actual false discovery rates. Build correction methods into your workflow, not your post-hoc analysis.

Because in a world of infinite testing opportunities, the multiple comparisons problem isn't a statistical nuance. It's a business risk masquerading as scientific rigor.

Get Mean Methods in your inbox

New posts delivered directly. No spam.

No spam. Unsubscribe anytime.

Related Reading