How to Calculate Sample Size: Ensuring Your Study Has Enough Power
Performing a proper sample size calculation requires specifying four interconnected quantities: the effect size you want to detect, the significance level (alpha), the desired statistical power, and the variability in your data. The following steps walk through this process systematically.
Step 1: Specify the Minimum Effect Size of Interest
The effect size is the smallest difference or relationship that would be meaningful in your context. For a two-group comparison, this might be expressed as Cohen's d (the difference between means in standard deviation units). For correlations, it is the minimum r value of interest. For proportions, it is the minimum difference in percentages. Choosing the right effect size is a scientific judgment, not a statistical one.
A drug that reduces blood pressure by 1 mmHg is statistically detectable with a large enough sample but clinically meaningless. A drug that reduces blood pressure by 10 mmHg is clinically important. Your sample size should target effects large enough to matter in practice, not the smallest effects detectable by brute-force sample size. When prior research exists, use observed effect sizes from similar studies as a guide, but remember that published effect sizes are often inflated due to publication bias, so discount them by 10-20%.
Step 2: Set Significance Level and Desired Power
The significance level (alpha) is the maximum acceptable probability of a Type I error (false positive), conventionally 0.05. Power is the probability of correctly detecting a true effect (1 minus the Type II error rate). Convention sets power at 0.80 (80% chance of detecting the effect if it exists), though 0.90 is preferred for important studies where missing a real effect would be costly.
The four quantities in power analysis are interconnected: sample size, effect size, alpha, and power. Given any three, you can solve for the fourth. Typically, you fix effect size, alpha, and power, then solve for the required sample size. Increasing power requires more participants. Lowering alpha (making the test more stringent) also requires more participants. Only increasing the target effect size reduces the required sample size, but artificially inflating the target effect size defeats the purpose of the calculation.
Step 3: Estimate Variability
Most sample size formulas require an estimate of the population standard deviation. Sources include pilot studies, previous research on similar populations, or published normative data. If no estimate is available, you can express the effect size in standardized units (Cohen's d) which absorbs the standard deviation into the effect size itself, eliminating the need for a separate variance estimate.
The accuracy of your variability estimate directly affects the accuracy of your sample size calculation. An underestimated standard deviation produces a sample size that is too small, resulting in an underpowered study. An overestimated standard deviation wastes resources by recruiting more participants than needed. When uncertainty about variability is high, conduct a small pilot study specifically to estimate the standard deviation before planning the main study.
Step 4: Apply the Formula or Software
For a two-sample t-test with equal groups, the approximate formula is: n per group = 2 * (z_alpha/2 + z_beta)^2 / d^2, where d is Cohen's d, z_alpha/2 is the z-value for the significance level (1.96 for alpha = 0.05 two-tailed), and z_beta is the z-value for the desired power (0.84 for 80% power). For d = 0.5, alpha = 0.05, and 80% power, this gives approximately 64 per group (128 total).
Software tools handle more complex scenarios. G*Power (free) covers most standard tests including t-tests, ANOVA, regression, and chi-square. R packages like pwr and simr handle standard and simulation-based power analysis respectively. These tools account for unequal group sizes, multiple groups, repeated measures, and other design complexities that the simple formula cannot address. For novel or complex designs where analytical formulas do not exist, simulation-based power analysis generates thousands of hypothetical datasets and calculates the proportion that produce significant results.
Step 5: Adjust for Practical Constraints
Increase your target sample size by 10-20% to account for participant dropout, missing data, or non-compliance. If you plan multiple comparisons or subgroup analyses, the effective alpha for each test is smaller (Bonferroni correction), requiring larger samples. If your design involves covariates that explain substantial variance, the required sample size decreases because the effective error variance is reduced.
Budget, time, and participant availability impose practical upper limits on sample size. When the calculated sample size exceeds what is feasible, you have several options: target a larger minimum effect size (accepting that smaller effects will go undetected), use a more efficient design (within-subjects rather than between-subjects, or adding covariates to reduce error variance), or use sequential analysis methods that allow early stopping when evidence is conclusive.
Common Benchmarks
For a two-group comparison at alpha = 0.05 and 80% power: detecting a large effect (d = 0.8) requires about 26 per group, a medium effect (d = 0.5) requires about 64 per group, and a small effect (d = 0.2) requires about 394 per group. These numbers illustrate why detecting small effects demands substantially more resources than detecting large ones, and why specifying the minimum meaningful effect size is so important.
For confidence intervals, the desired margin of error determines sample size: n = (z * sigma / margin)^2. To estimate a mean within plus or minus 2 units with 95% confidence and an estimated standard deviation of 10, you need n = (1.96 * 10 / 2)^2 = approximately 97 observations. Narrower margins of error require exponentially more participants because precision scales with the square root of sample size.
Why Underpowered Studies Are Harmful
An underpowered study (power below 80%) is likely to produce a non-significant result even when the effect is real. This wastes the resources invested in the study and exposes participants to research procedures without generating useful knowledge. Worse, the few significant results from underpowered studies tend to overestimate the true effect size, because only the largest observed effects cross the significance threshold. This winner's curse means that published effect sizes from small studies are inflated and will not replicate at the reported magnitude.
Meta-analyses combining many underpowered studies can partially compensate for individual studies lacking power, but this approach inherits all the biases of the contributing studies (publication bias, selective reporting) and cannot substitute for adequately powered individual studies. The best approach is to plan adequate power from the beginning and treat sample size calculation as a required step in study design rather than an afterthought.
Sample size planning requires specifying the smallest meaningful effect size, desired power (typically 80%), and significance level (typically 0.05). Use power analysis software before collecting data to ensure your study can answer your research question. Inflate the target to account for attrition and missing data, and remember that underpowered studies waste resources and produce inflated effect size estimates.