How to Do a T-Test: Comparing Means Between Two Groups
Performing a t-test correctly requires choosing the right variant for your design, verifying that assumptions are met, computing the test statistic, and interpreting the results in context. The following steps walk through the complete process.
Step 1: Choose the Type of T-Test
The independent samples t-test (also called unpaired or two-sample t-test) compares means from two separate groups with different participants in each. Examples: comparing test scores between students taught by Method A versus Method B, or comparing recovery times between Drug and Placebo groups. The two groups must be independent, meaning that one participant's score does not influence another's.
The paired samples t-test (also called dependent t-test) compares two measurements from the same participants. Examples: comparing blood pressure before and after treatment in the same patients, or comparing performance on a task under two conditions where each person experiences both conditions. Pairing controls for individual differences, which increases statistical power.
The one-sample t-test compares a sample mean to a known or hypothesized population value. Example: testing whether students at a particular school score differently from the national average of 100 on a standardized test. This variant is also used to test whether a mean difference score differs from zero, which is mathematically equivalent to the paired t-test.
Step 2: Check Assumptions
All t-tests assume the data (or differences, for paired tests) are approximately normally distributed within groups. With sample sizes above 25-30 per group, t-tests are robust to non-normality due to the central limit theorem. For smaller samples, check normality with histograms, Q-Q plots, or the Shapiro-Wilk test. Severe skewness or heavy tails in small samples suggest using nonparametric alternatives instead.
The independent samples t-test additionally assumes equal variances between groups (homoscedasticity). Levene's test evaluates this assumption. When variances differ substantially, use Welch's t-test, which adjusts the degrees of freedom and does not assume equal variances. Most modern software defaults to Welch's version because it performs well regardless of whether variances are equal, making it a safe choice in practice.
Both variants assume independence of observations. Each participant's score should not influence another's. Violation of independence (e.g., students from the same classroom, repeated measurements beyond a pair) requires different methods like mixed-effects models or cluster-adjusted tests.
Outliers deserve special attention in t-tests because the mean and standard deviation are sensitive to extreme values. A single outlier can dramatically shift the group mean and inflate the standard deviation, potentially masking a real group difference or creating a spurious one. Examine your data for outliers before running the test. If outliers are genuine data points rather than errors, consider robust alternatives like trimmed means or bootstrapped confidence intervals that are less affected by extreme values.
Step 3: Calculate the T-Statistic
The t-statistic measures how many standard errors the observed difference between means falls from zero (or from the hypothesized difference). The general formula is: t = (observed difference - hypothesized difference) / standard error of the difference. A larger absolute t indicates more evidence against the null hypothesis of no difference. The sign of t indicates the direction of the difference.
For independent samples with Welch's formula: SE = sqrt(s1-squared/n1 + s2-squared/n2), where s1 and s2 are the sample standard deviations and n1 and n2 are the sample sizes. For paired samples: SE = sd(differences) / sqrt(n), where sd(differences) is the standard deviation of the difference scores and n is the number of pairs.
Step 4: Find the P-Value
Compare the calculated t-statistic to the t-distribution with the appropriate degrees of freedom. For independent samples with Welch's formula, df is calculated from sample sizes and variances using the Welch-Satterthwaite approximation, which typically produces a non-integer value. For paired samples, df = n - 1 where n is the number of pairs. The p-value is the probability of observing a t-statistic at least as extreme as yours, assuming no true difference exists.
For a two-tailed test, the p-value accounts for extreme values in both directions (both positive and negative t). For a one-tailed test, it only considers one direction. Use two-tailed tests unless you have a strong, pre-specified reason to predict the direction of the effect. Switching from two-tailed to one-tailed after seeing the data is a form of p-hacking that inflates false positive rates.
Statistical software reports exact p-values (e.g., p = 0.0342) rather than just comparing to the 0.05 threshold. Always report the exact p-value because it carries more information than a simple significant/not-significant binary. A result with p = 0.049 has similar evidential strength to one with p = 0.051, despite falling on different sides of the conventional threshold. Reporting exact p-values lets readers apply their own standards and judge the strength of evidence for themselves.
Step 5: Report Results with Effect Size
A complete t-test report includes: group means and standard deviations, the t-value, degrees of freedom, exact p-value, confidence interval for the difference, and Cohen's d (difference between means divided by pooled standard deviation). Example: "The treatment group (M = 82.3, SD = 9.1) scored significantly higher than the control group (M = 75.6, SD = 10.4), t(58) = 2.67, p = .010, d = 0.69, 95% CI [1.6, 11.8]."
Cohen's d provides a standardized effect size that allows comparison across studies using different measurement scales. Conventional benchmarks are d = 0.2 (small), d = 0.5 (medium), and d = 0.8 (large), though context determines what counts as meaningful. A medical treatment with d = 0.3 might save thousands of lives when applied to millions of patients, while a d = 0.8 difference in preference for two flavors of ice cream may be inconsequential.
When to Use Alternatives
Use ANOVA when comparing three or more groups, because running multiple t-tests inflates the Type I error rate. Use nonparametric alternatives (Mann-Whitney U for independent samples, Wilcoxon signed-rank for paired) when data are ordinal or severely non-normal with small samples. Use regression when you want to control for covariates while comparing groups, or when you have both continuous and categorical predictors.
For designs with more than two time points or conditions per participant, repeated-measures ANOVA or mixed-effects models replace the paired t-test. For designs where the outcome is binary (yes/no, pass/fail) rather than continuous, chi-square tests or logistic regression are appropriate instead of t-tests. Choosing the right test for your data structure and research question is as important as executing the test correctly.
The bootstrap t-test provides an alternative when assumptions about the sampling distribution of the t-statistic are questionable. Rather than relying on theoretical t-distributions, bootstrapping resamples the data thousands of times to empirically construct the sampling distribution. This approach is valid for non-normal data and unequal variances without requiring the Welch correction, though it performs poorly with very small samples (fewer than 10-15 per group) where the resampling process cannot adequately represent the population.
The t-test compares two group means by calculating how many standard errors apart they fall. Choose independent or paired versions based on your design, verify normality and equal variance assumptions, and always report Cohen's d alongside the p-value to communicate both statistical and practical significance.