How to Do Power Analysis: Calculating the Right Sample Size

Updated May 2026
Power analysis is the mathematical method for calculating the sample size your experiment needs to detect a real effect with a specified probability. It connects four quantities: sample size, effect size, significance level, and statistical power. Given any three of these values, power analysis computes the fourth, most commonly determining the sample size needed for a target level of power.

Skipping power analysis is one of the most common and most damaging mistakes in experimental research. Without it, researchers either collect too few participants (producing underpowered studies that are unlikely to detect real effects) or too many (wasting time, money, and participant effort). Neither outcome serves science well.

Identify Your Statistical Test

Each statistical test has its own power function, so the first step is choosing the test that matches your experimental design. An independent-samples t-test compares two group means in a between-subjects design. A paired t-test compares two conditions in a within-subjects design. One-way ANOVA compares three or more group means. Factorial ANOVA handles multi-factor designs. Correlation tests the strength of linear relationships. Chi-squared tests compare proportions or categorical distributions.

The design determines the test, and the test determines the power formula. A within-subjects design comparing two conditions uses the paired t-test power formula, which produces smaller required sample sizes than the independent-samples formula because within-subject correlations reduce the effective variance. Getting the test wrong produces incorrect sample size estimates.

Determine the Expected Effect Size

The effect size is the most critical input and the hardest to estimate. Cohen d (the difference between means in standard deviation units) is used for t-tests. f (the ratio of between-group to within-group variation) is used for ANOVA. r (the correlation coefficient) is used for correlation tests. w is used for chi-squared tests of proportions.

Three sources for effect size estimates, in order of preference: published meta-analyses of similar interventions provide the most reliable estimates because they aggregate multiple studies. Pilot study data from your own lab provide estimates specific to your materials and population, though small pilot samples produce imprecise estimates. Cohen conventional benchmarks (small = 0.2, medium = 0.5, large = 0.8 for d) provide rough guidance when no empirical data exist, but they are arbitrary and should be used only as a last resort.

When in doubt, power for a smaller effect than you expect. If your pilot suggests d = 0.6 but you power for d = 0.4, the study will have more than enough power if the true effect is 0.6 and will still be adequately powered if the true effect is somewhat smaller than the pilot suggested (which is common, since small-sample effect sizes tend to be inflated).

Set Alpha and Desired Power

Alpha (the significance level) is conventionally set at 0.05, but some contexts warrant different thresholds. Fields with high replication demands may use 0.005, as proposed by Benjamin et al. (2018) for claims of new discoveries. Exploratory studies might use 0.10 to increase sensitivity. Multiple comparison corrections (Bonferroni, Holm, FDR) effectively reduce alpha for each individual test, requiring larger samples to maintain power.

Power is conventionally targeted at 0.80, meaning an 80 percent chance of detecting a real effect. For important studies with high stakes, 0.90 or 0.95 may be appropriate. Higher power requires larger samples: going from 0.80 to 0.90 power typically increases the required sample by about 30 percent, and going from 0.80 to 0.95 increases it by about 60 percent.

Compute Sample Size Using Software

G*Power is a free, user-friendly desktop application that handles most common designs. Select the test family (t-tests, F-tests, chi-squared, etc.), the specific test within that family, the type of analysis (a priori sample size calculation), and enter the effect size, alpha, and desired power. The software computes the required sample size instantly and can generate power curves showing how power changes with sample size.

In R, the pwr package provides functions for common tests: pwr.t.test() for t-tests, pwr.anova.test() for one-way ANOVA, pwr.r.test() for correlations, and pwr.chisq.test() for chi-squared tests. For complex designs (multilevel models, mixed ANOVAs, generalized linear models), the simr package uses Monte Carlo simulation to estimate power. You specify a model, set the parameters to the expected values, and the package simulates thousands of datasets and analyses to estimate the proportion that produce a significant result.

Always report your power analysis in the Methods section of your paper. Specify the software used, the input parameters (test, effect size, alpha, power), the rationale for the chosen effect size, and the resulting sample size. Reviewers and readers need this information to evaluate whether the study was adequately powered.

Common Mistakes in Power Analysis

The most frequent mistake is using an unrealistically large effect size to justify a small, convenient sample. Researchers sometimes cherry-pick the largest effect size from prior literature, ignoring that published effect sizes are inflated by publication bias and small-sample noise. A more defensible approach is to use the smallest effect size of practical importance, the minimum difference that would be meaningful in applied terms, rather than the largest effect that has been reported.

Another common error is conducting power analysis after the study is complete (post-hoc power analysis). Post-hoc power is a direct mathematical transformation of the observed p-value and provides no additional information beyond what the p-value already contains. If a result is non-significant, the post-hoc power will inevitably be low. Reporting post-hoc power as though it explains a null result is circular reasoning. The appropriate time for power analysis is during the design phase, when sample size can still be adjusted.

Failing to account for the complexity of the design is another pitfall. A simple two-group comparison requires a different sample size calculation than a 2x3 factorial design, a repeated-measures design with multiple time points, or a multilevel design with participants nested within clusters. Using a formula intended for simple designs in a complex study will typically underestimate the required sample size. Software packages like G*Power, R packages (pwr, simr), and simulation-based approaches handle complex designs more accurately than hand calculations.

Finally, researchers sometimes forget that power depends on the reliability of the outcome measure. An unreliable measure attenuates the observed effect size, requiring a larger sample to detect the same true effect. If the reliability of the dependent variable is 0.70, the observed effect size will be only 70 percent of the true effect size. Correcting for measurement unreliability in the power calculation yields a more accurate estimate of the required sample size.

Key Takeaway

Power analysis is not optional, it is a fundamental part of experimental planning that determines whether your study has a reasonable chance of answering the question it asks. Run it before collecting data, not after.