How to Do Hypothesis Testing: A Step-by-Step Guide
Hypothesis testing underlies nearly every published research finding that claims a statistically significant result. Understanding the procedure, its logic, and its limitations is essential for both conducting and critically evaluating scientific research.
Step 1: State the Null and Alternative Hypotheses
The null hypothesis (H0) represents the default position, typically stating that there is no effect, no difference, or no relationship. It is the claim you are trying to find evidence against. Examples include: "the new drug has no effect on blood pressure," "there is no difference in test scores between the two teaching methods," or "there is no correlation between exercise frequency and sleep quality."
The alternative hypothesis (H1 or Ha) states what you believe might be true and are testing for. It is the logical complement of the null hypothesis. A two-tailed alternative states that the parameter differs from the null value in either direction (the drug changes blood pressure, either up or down). A one-tailed alternative specifies a direction (the drug lowers blood pressure). Two-tailed tests are more conservative and more common in practice because they protect against unexpected effects in the opposite direction.
The hypotheses must be stated before looking at the data. Formulating hypotheses after seeing results (known as HARKing, or Hypothesizing After Results are Known) invalidates the logic of the test and inflates false positive rates because you are effectively testing only hypotheses that the data already support.
Step 2: Choose the Significance Level and Appropriate Test
The significance level (alpha) is the probability threshold below which you will reject the null hypothesis. The conventional choice is alpha = 0.05, meaning you accept a 5% risk of falsely rejecting a true null hypothesis (a Type I error). More stringent thresholds like 0.01 or 0.001 are appropriate when the consequences of a false positive are severe, such as in drug approval or physics discoveries. Less stringent thresholds like 0.10 might be used in exploratory research where missing a real effect (Type II error) is the greater concern.
The choice of statistical test depends on the type of data, the number of groups, and the research question. Use a t-test for comparing means between two groups, ANOVA for three or more groups, a chi-square test for categorical data, or regression for continuous relationships. Each test has assumptions (normality, independence, equal variances) that should be checked before proceeding.
Step 3: Calculate the Test Statistic
The test statistic quantifies how far your observed data falls from what the null hypothesis predicts, measured in standardized units. Different tests produce different statistics (t, z, F, chi-square), but all share the same logic: they compare the observed effect to the amount of random variation expected under the null hypothesis.
For a one-sample t-test, the formula is: t = (sample mean - hypothesized mean) / (standard error of the mean). The standard error equals the sample standard deviation divided by the square root of the sample size. A large test statistic means the observed result is many standard errors away from the null hypothesis value, suggesting the null may be wrong.
Software computes test statistics automatically, but understanding the formula reveals the logic: the test statistic is essentially a signal-to-noise ratio. The numerator measures the signal (how far the observed value deviates from the null), and the denominator measures the noise (how much random variation you expect given your sample size). Large signals relative to noise produce large test statistics and small p-values.
Step 4: Determine the P-Value
The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. It answers the question: "If there truly were no effect, how surprising would my data be?"
A p-value of 0.03 means that if the null hypothesis were true, data this extreme or more extreme would occur only about 3% of the time by chance. A p-value of 0.50 means the data is completely unremarkable under the null hypothesis, providing no evidence against it.
The p-value is NOT the probability that the null hypothesis is true. It is NOT the probability that your results occurred by chance. It IS the probability of the observed (or more extreme) data under the assumption that the null is true. This distinction matters enormously for correct interpretation.
Step 5: Make a Decision and Interpret
If the p-value is less than your predetermined alpha level, reject the null hypothesis in favor of the alternative. If the p-value exceeds alpha, fail to reject the null hypothesis. Note the asymmetry: you never "accept" the null hypothesis, you only "fail to reject" it, because absence of evidence is not evidence of absence. A non-significant result might mean the null is true, or it might mean your study lacked the statistical power to detect a real effect.
Beyond the binary reject/fail-to-reject decision, report the effect size (how large the difference or relationship is), the confidence interval (the range of plausible values for the population parameter), and the practical significance (whether the effect matters in real-world terms). A statistically significant result with a tiny effect size may have no practical importance. A non-significant result with a wide confidence interval signals that more data is needed rather than that no effect exists.
Type I and Type II Errors
Two kinds of errors are possible in hypothesis testing. A Type I error (false positive) occurs when you reject a true null hypothesis, concluding an effect exists when it does not. The probability of a Type I error equals alpha, your significance level. A Type II error (false negative) occurs when you fail to reject a false null hypothesis, missing a real effect. The probability of a Type II error is denoted beta, and 1 - beta is the statistical power of the test.
There is an inherent trade-off between these errors. Lowering alpha (making it harder to reject H0) reduces Type I errors but increases Type II errors. The only way to reduce both simultaneously is to increase the sample size, which makes the test more sensitive to real effects while maintaining the same false positive rate.
Common Pitfalls
Several common mistakes plague hypothesis testing in practice. P-hacking involves running many tests or manipulating analysis choices until a significant result appears, dramatically inflating the actual false positive rate above the nominal 5%. Multiple comparisons without correction means that testing 20 independent hypotheses at alpha = 0.05 yields an expected one false positive even when all null hypotheses are true. Confusing statistical and practical significance leads researchers to overstate the importance of trivially small effects detected in very large samples.
Pre-registration, where you specify your hypotheses and analysis plan before collecting data, protects against many of these pitfalls by preventing post-hoc adjustments that exploit random patterns in the data. Reporting effect sizes and confidence intervals alongside p-values provides a more complete picture than a binary significant/not-significant decision alone.
Hypothesis testing follows a structured five-step process: state hypotheses, choose alpha and test, calculate the test statistic, determine the p-value, and make a decision. Always report effect sizes and confidence intervals alongside p-values, and never confuse statistical significance with practical importance.