P-Values Explained Simply: What They Mean and What They Do Not
The Definition in Plain Language
Imagine you are testing whether a coin is fair. You flip it 100 times and get 60 heads. The null hypothesis says the coin is fair (probability of heads = 0.5). The p-value answers: "If this coin truly were fair, what is the probability of getting 60 or more heads (or 40 or fewer) in 100 flips?" Using the binomial distribution, that probability works out to approximately 0.057. This means that about 5.7% of the time, a fair coin would produce results this extreme or more extreme by pure chance.
The p-value does not tell you the probability that the coin is fair. It does not tell you the probability that your result occurred by chance. It tells you how surprising your data would be in a world where the null hypothesis is true. That subtle distinction is the source of most misinterpretations.
Think of it as a compatibility measure. A p-value of 0.80 means your data is highly compatible with the null hypothesis, the results you observed would be commonplace if the null were true. A p-value of 0.001 means your data is very incompatible with the null hypothesis, results this extreme would occur only about 1 in 1000 times if the null were true. Low compatibility with the null suggests (but does not prove) that the null may be wrong.
What P-Values Are NOT
Decades of research on statistical literacy have documented persistent misunderstandings of p-values among researchers, students, and even textbook authors. The following statements are all FALSE:
"The p-value is the probability that the null hypothesis is true." The p-value is calculated assuming the null is true, so it cannot simultaneously tell you the probability of that assumption. To compute P(null is true | data), you would need Bayes' theorem and a prior probability for the null hypothesis, which is exactly what Bayesian statistics provides.
"A p-value of 0.05 means there is only a 5% chance the result is due to chance." The p-value is the probability of the data given the null, not the probability of the null given the data. These are different quantities. In research areas where most tested hypotheses are false (exploratory research with many comparisons), even p = 0.05 results may have a high probability of being false positives.
"A smaller p-value means a larger or more important effect." P-values conflate effect size with sample size. A trivially small difference can produce a tiny p-value if the sample is large enough. A clinically meaningful difference can produce a large p-value if the sample is too small. Always examine the effect size separately from the p-value.
"A non-significant p-value means there is no effect." Failure to reject the null hypothesis is not the same as confirming it. A non-significant result might mean the null is true, but it might equally mean your study was underpowered. The confidence interval clarifies this: if the interval is narrow and contains only trivial effect sizes, you have good evidence of no meaningful effect. If the interval is wide and includes both zero and large effects, you simply do not have enough data to draw a conclusion.
The 0.05 Threshold
The conventional alpha = 0.05 threshold was popularized by Ronald Fisher in the 1920s as a convenient reference point, not as an inviolable boundary between truth and falsehood. Fisher himself described it as a guide for when evidence is worth a second look, not as a rigid decision rule. Nevertheless, the threshold has calcified in scientific practice into a binary classification system where p < 0.05 is "significant" and p >= 0.05 is "not significant."
This binary thinking creates problems. A result with p = 0.049 is treated as fundamentally different from p = 0.051, even though the evidence strength is nearly identical. The sharp cutoff incentivizes p-hacking (adjusting analyses until p crosses 0.05) and selective reporting (publishing only significant results). Many statisticians and journals have moved toward reporting exact p-values and emphasizing confidence intervals and effect sizes over binary significance declarations.
In 2019, over 800 statisticians signed a letter in Nature calling for the retirement of "statistically significant" as a dichotomy. Their recommendation: report exact p-values as one piece of evidence, always accompanied by effect sizes, confidence intervals, and contextual judgment about practical importance. Some fields have adopted stricter thresholds (particle physics requires p < 0.0000003 for discovery claims), while others use more lenient thresholds for exploratory work (p < 0.10 as suggestive evidence).
P-Values and Sample Size
P-values depend heavily on sample size because larger samples produce smaller standard errors, which amplify test statistics. The same observed difference between two groups will yield a smaller p-value as sample size increases. With 20 participants per group, a 3-point difference in blood pressure might produce p = 0.25. With 2000 per group, the same 3-point difference might produce p = 0.0001. The effect did not become more real or more important, the test simply became more sensitive.
This sensitivity means that very large datasets (common in big data applications, social media research, and electronic health records) will find statistically significant results for nearly every comparison, including differences so small they have no practical, clinical, or theoretical relevance. In these contexts, effect sizes and practical significance must take precedence over p-values in determining what findings matter.
Conversely, small samples produce large p-values even when real, substantial effects exist. A pilot study with 10 participants showing a large effect size but p = 0.12 does not mean the effect is absent. It means the study lacked sufficient statistical power to detect it reliably. The appropriate response is to collect more data, not to conclude the effect does not exist.
One-Tailed vs Two-Tailed P-Values
A two-tailed test considers deviations in both directions from the null hypothesis. If testing whether a coin is fair, a two-tailed test asks whether the observed proportion of heads is significantly different from 0.5, either too high or too low. The p-value includes probability from both tails of the distribution. Most tests in published research are two-tailed because researchers want to detect effects in either direction.
A one-tailed test considers deviations in only one direction. If you predict that a new drug will reduce blood pressure (not just change it), a one-tailed test places all the rejection region in one tail. This makes the test more powerful for detecting the predicted direction but completely unable to detect effects in the opposite direction. One-tailed tests are controversial because they require a strong directional prediction stated before data collection, and they can miss real effects that happen to go the wrong way. Most statisticians recommend two-tailed tests unless a one-tailed alternative is clearly justified by theory or prior evidence.
Reporting P-Values Properly
Good statistical practice involves reporting p-values as part of a comprehensive results summary, not as the sole basis for conclusions. A well-reported result includes: the test used, the test statistic value, the degrees of freedom, the exact p-value (not just "p < 0.05"), the effect size with a confidence interval, and a statement about practical significance in context.
For example, rather than writing "the groups differed significantly (p < 0.05)," write "the treatment group scored 8.3 points higher on average than the control group (t(48) = 2.47, p = 0.017, Cohen's d = 0.70, 95% CI [1.5, 15.1])." The second version tells the reader the direction and magnitude of the effect, the precision of the estimate, and the strength of evidence, enabling them to form their own judgment about importance.
A p-value measures how surprising your data would be if the null hypothesis were true. It is not the probability of the null being true, not the probability of a chance result, and not a measure of effect importance. Always report p-values alongside effect sizes and confidence intervals for a complete picture.