Common Statistical Errors: Mistakes That Invalidate Research Findings
P-Hacking and Data Dredging
P-hacking refers to the practice of trying multiple analyses, outcome measures, subgroup comparisons, or data exclusion criteria until a statistically significant result emerges. Researchers might run 20 different statistical tests on their data and report only the one that produced p < 0.05, without disclosing the 19 failed attempts. Since each test carries a 5% false positive rate, testing 20 hypotheses yields an expected one false positive even when no true effects exist.
Common forms include: removing outliers selectively (only when it improves the p-value), testing multiple outcome variables and reporting only the significant one, adding or removing covariates until results become significant, stopping data collection as soon as significance is reached, and analyzing subgroups until a significant comparison appears. Pre-registration of hypotheses and analysis plans before data collection is the primary defense against p-hacking.
Multiple Comparisons Without Correction
When you test multiple hypotheses simultaneously, the probability of at least one false positive grows rapidly. With 20 independent tests at alpha = 0.05, the probability of at least one false positive is 1 - (0.95)^20 = 64%. This family-wise error rate must be controlled through corrections such as Bonferroni (divide alpha by number of tests), Holm step-down procedure, or false discovery rate (FDR) control using the Benjamini-Hochberg method.
Genomics studies testing thousands of genes simultaneously would produce hundreds of false discoveries without these corrections. Even in smaller-scale research, running separate t-tests for every pair of groups in a multi-group study inflates false positive rates. Using ANOVA with post-hoc tests that incorporate multiple comparisons corrections (Tukey HSD, Bonferroni) properly controls the error rate.
Confusing Correlation with Causation
Observational studies can only establish associations, never causation, regardless of the strength of the correlation or the sophistication of the statistical model. No amount of regression analysis can rule out unmeasured confounders that might explain an observed relationship. Only randomized experiments with appropriate controls can establish causal claims.
Yet researchers routinely use causal language ("X increases Y," "A leads to B") when reporting observational results, misleading readers about the strength of evidence. Media headlines amplify this problem by converting associations into causal claims. "Coffee drinkers live longer" sounds like causation but reflects only a correlation that could be explained by dozens of confounding variables related to lifestyle, income, and health behaviors.
Interpreting Non-Significance as No Effect
Failing to reject the null hypothesis does not confirm it. A non-significant result (p > 0.05) can mean the null is true, but it can equally mean the study lacked adequate power to detect a real effect. A study with 15 participants per group has only 34% power to detect a medium effect (d = 0.5), meaning it will miss such an effect two-thirds of the time.
The confidence interval clarifies the distinction: if it is narrow and centered near zero, evidence supports no meaningful effect. If it is wide and includes both zero and substantial effects, the study was simply too imprecise to draw conclusions. Equivalence testing provides a formal framework for concluding that an effect is negligibly small, by testing whether the effect falls within a pre-specified range of practical equivalence.
Confusing Statistical and Practical Significance
A result can be statistically significant (p < 0.05) while being practically meaningless. With 500,000 observations, a correlation of r = 0.01 is statistically significant but explains only 0.01% of the variance. Conversely, a clinically important treatment effect may not reach statistical significance in a small, underpowered study. Always report effect sizes alongside p-values to distinguish real importance from mere detectability.
The solution is to evaluate both statistical and practical significance together. A study reporting p = 0.001 and d = 0.05 has found a real but trivial effect. A study reporting p = 0.08 and d = 0.75 may have found an important effect that the sample was too small to confirm. Confidence intervals communicate both the estimated magnitude and the precision of the estimate, providing a more complete picture than either the p-value or effect size alone.
Simpson's Paradox
Simpson's paradox occurs when a trend that appears in subgroups reverses when the subgroups are combined. A treatment might appear to be more effective than a placebo in both mild and severe cases separately, yet appear less effective overall because sicker patients disproportionately received the treatment. This paradox arises from confounding variables that affect both the grouping and the outcome, and it demonstrates why controlling for relevant variables is essential in regression and other analyses.
The famous Berkeley gender bias case illustrates this. Overall admission rates appeared to favor men, but examining individual departments revealed that women were admitted at equal or higher rates in most departments. Women disproportionately applied to more competitive departments with lower admission rates, creating the appearance of overall bias that did not exist within any department. The lesson is that aggregate data can produce conclusions that contradict the patterns in disaggregated data.
Survivorship Bias
Survivorship bias occurs when analyses include only observations that "survived" some selection process while ignoring those that did not. Analyzing only successful companies to identify success factors ignores all the failed companies that did the same things. Studying only patients who completed a drug trial ignores those who dropped out due to side effects. The survivors are a biased sample that overestimates positive outcomes and underestimates risks.
The classic example is the World War II bomber analysis, where engineers initially proposed reinforcing the areas of returning planes that showed the most damage. The statistician Abraham Wald recognized that the returning planes represented survivors, and that the areas without damage were actually the most critical because planes hit there never returned. This insight illustrates how focusing only on survivors produces conclusions that are the exact opposite of correct.
Ecological Fallacy and Base Rate Neglect
The ecological fallacy draws conclusions about individuals from group-level data. Countries with higher chocolate consumption have more Nobel laureates, but this does not mean that individuals who eat more chocolate are more likely to win Nobel Prizes. The correlation exists at the country level because wealthy nations have both luxury food consumption and research funding. Inferring individual-level relationships from aggregate data is statistically invalid because group averages hide within-group variation.
Base rate neglect occurs when people ignore the prevalence (base rate) of a condition when interpreting test results. A medical test with 99% sensitivity and 95% specificity sounds highly accurate, but when the disease affects only 1 in 1000 people, a positive result still means only about a 2% chance of actually having the disease. The large number of healthy people producing false positives overwhelms the small number of sick people producing true positives. Bayes theorem correctly handles this calculation by incorporating the prior probability of the condition.
Common statistical errors arise from p-hacking, ignoring multiple comparisons, confusing correlation with causation, treating non-significance as proof of no effect, survivorship bias, and base rate neglect. Pre-registration, correction for multiple testing, reporting confidence intervals and effect sizes, and Bayesian reasoning protect against these mistakes. Awareness of these errors is the first step toward avoiding them in your own research.