Statistical Power Explained: The Probability of Detecting True Effects

Updated June 2026
Statistical power is the probability that a test will correctly reject the null hypothesis when it is actually false, meaning the probability of detecting a real effect when one truly exists. A study with 80% power has an 80% chance of producing a statistically significant result if the true effect is as large as specified. Underpowered studies (low power) frequently fail to detect real effects, wasting resources and potentially leading to false conclusions that effects do not exist.

Power and Type II Error

Power = 1 - beta, where beta is the probability of a Type II error (failing to reject a false null hypothesis). If power is 0.80, then beta is 0.20, meaning there is a 20% chance of missing a real effect. The conventional minimum acceptable power is 0.80, though 0.90 is preferred for important research where failing to detect an effect would have serious consequences, such as approving an ineffective drug or missing a safety signal in a clinical trial.

Power is not a fixed property of a test. It depends on four interrelated factors: sample size, effect size, significance level (alpha), and the specific statistical test used. Given any three, you can solve for the fourth. This flexibility allows researchers to plan studies (determine required sample size), evaluate completed studies (assess post-hoc power), and compare test sensitivity (which test has more power for a given scenario). Understanding these four components and how they interact is the foundation of every well-designed study.

The relationship between power and the other three factors can be visualized as a balance. Increasing sample size pushes power upward. Larger true effects are easier to detect, also increasing power. A stricter significance threshold (lower alpha) makes it harder to reject the null, decreasing power. A researcher who wants 90% power to detect a small effect at alpha = 0.01 will need a substantially larger sample than one who accepts 80% power for a medium effect at alpha = 0.05.

Factors That Affect Power

Sample size is the most controllable factor. Larger samples produce smaller standard errors, which make it easier to detect real differences. Doubling the sample size does not double power (the relationship is nonlinear through the square root in the standard error formula), but it always increases it. This is why sample size planning is essential before data collection. A researcher planning a study should calculate the minimum sample size needed for adequate power, then add a buffer for anticipated attrition or missing data.

Effect size is the magnitude of the true effect. Larger effects are easier to detect, requiring fewer participants. Detecting a mean difference of 10 points is easier than detecting a difference of 2 points, given the same variability. Researchers must specify the smallest effect they consider worth detecting, which determines the minimum sample size. This is often the most difficult judgment call in study planning, because the true effect size is unknown before the study is conducted. Published literature, pilot studies, and expert judgment all inform this decision, though each source has its own biases.

Significance level (alpha) affects power through its relationship with the critical value. A more stringent alpha (0.01 vs 0.05) reduces power because it requires stronger evidence to reject the null. This creates the fundamental trade-off: reducing false positives (lower alpha) increases false negatives (lower power), and vice versa. The only way to improve both simultaneously is to increase sample size. In fields like particle physics, where alpha is set at roughly 0.0000003 (the five-sigma standard), enormous datasets are required to maintain reasonable power.

Variability in the data (standard deviation) inversely affects power. Noisier data makes it harder to detect signals. Researchers can increase power by reducing variability through more precise measurements, more homogeneous samples, or within-subjects designs that remove between-person variability. A within-subjects design can dramatically increase power because each participant serves as their own control, eliminating the individual differences that inflate error variance in between-subjects comparisons.

Power Curves and Sensitivity Analysis

A power curve plots statistical power on the vertical axis against effect size (or sample size) on the horizontal axis, illustrating how power changes across a range of scenarios. The curve is always S-shaped: power starts near alpha for negligible effects (because even random noise occasionally produces significant results), then rises steeply through the range of medium effects, and asymptotically approaches 1.0 for very large effects. Reading these curves helps researchers understand the sensitivity profile of their planned study rather than relying on a single power calculation at one assumed effect size.

Sensitivity analysis extends power curves by computing power across a range of plausible effect sizes rather than committing to a single estimate. This approach is more honest than traditional power analysis because it acknowledges uncertainty about the true effect size. A researcher might report that their study has 80% power to detect d = 0.50, 60% power for d = 0.35, and only 35% power for d = 0.20. This transparency helps reviewers and readers calibrate how much confidence to place in both significant and non-significant findings from the study.

The Consequences of Low Power

Underpowered studies have several harmful consequences beyond simply missing real effects. First, they waste resources by providing inconclusive results. A clinical trial that enrolls 50 participants when it needs 200 for adequate power will most likely produce a non-significant result regardless of whether the treatment works. The patients, researchers, and funders all invest time and money for no actionable information.

Second, when an underpowered study does produce a significant result (which happens by chance or when the effect is particularly large in that sample), the estimated effect size tends to be inflated, a phenomenon called the winner's curse or Type M error (magnitude error). This inflation occurs because only exaggerated effects manage to cross the significance threshold when power is low. A true effect of d = 0.3 in an underpowered study might only reach significance when the sample happens to produce an estimate of d = 0.7 or higher, creating a published literature of inflated effects that subsequent studies fail to replicate.

Third, a literature dominated by underpowered studies creates a distorted picture where published effects appear larger than they truly are (because only inflated estimates reach significance and get published) while many true effects never appear in the literature at all (because underpowered studies produce non-significant results that go unpublished). This combination of the winner's curse and publication bias fuels the replication crisis that has affected fields like psychology, medicine, and economics over the past two decades.

Power Analysis in Practice

Prospective (a priori) power analysis determines the sample size needed before data collection. It requires specifying the anticipated effect size, desired power (typically 0.80), and significance level (typically 0.05). Sources for the anticipated effect size include pilot data, previous research, meta-analyses, or the smallest effect considered practically meaningful. When multiple sources disagree, erring toward the smaller estimated effect produces a more conservative (larger) sample size, which protects against being underpowered.

Post-hoc (observed) power analysis computes the power of a completed study given its observed effect size. This practice is controversial and largely uninformative: observed power is a direct function of the p-value, so a non-significant result always corresponds to low observed power by definition. It adds no information beyond what the p-value already tells you. Instead of post-hoc power, report confidence intervals, which directly show the range of effects consistent with the data and are far more informative for understanding what the study can and cannot conclude.

Power for Different Statistical Tests

Different tests have different power characteristics for the same data. A paired t-test is more powerful than an independent t-test when within-subject variability is smaller than between-subject variability, because pairing removes individual differences from the error term. ANOVA has more power than multiple pairwise t-tests for detecting overall group differences because it controls the family-wise error rate without losing power to multiple comparisons corrections.

Parametric tests generally have more power than their nonparametric equivalents when distributional assumptions are met. The Mann-Whitney U test has approximately 95% of the power of an independent t-test for normally distributed data, a small sacrifice that may be worthwhile when normality is doubtful. Regression analyses gain power by including covariates that explain residual variance, because reducing unexplained variance shrinks the standard error and makes effects easier to detect.

Common Mistakes in Power Analysis

The most frequent error is using an unrealistically large effect size to justify a conveniently small sample. Researchers sometimes cite published effects as their target, ignoring that published effect sizes are systematically inflated by the same low-power, publication-bias dynamics that power analysis is meant to prevent. Using the smallest clinically or practically meaningful effect size, rather than the expected or published effect size, produces studies that can detect effects worth caring about.

Another common mistake is treating 80% power as a guarantee rather than a probability. A study with 80% power still has a 20% chance of missing a real effect. Running five such studies on a real phenomenon, there is a 33% chance that at least one will be non-significant. Researchers should view power as a planning tool that reduces the probability of waste, not as insurance against inconclusive results. Additionally, power calculations assume perfect execution: no missing data, no protocol violations, no measurement errors beyond what the standard deviation captures. Real studies rarely achieve their planned power because of these practical complications.

Key Takeaway

Statistical power is the probability of detecting a true effect, determined by sample size, effect size, alpha, and data variability. Always conduct prospective power analysis to ensure your study can answer its research question, and aim for at least 80% power against the smallest effect you consider meaningful. Remember that underpowered studies do not simply fail quietly; they actively distort the literature through inflated published effect sizes and selective reporting.