How to Do Regression Analysis: From Simple Linear to Multiple Regression
Simple linear regression models the relationship between one predictor and one outcome using a straight line. Multiple regression extends this to two or more predictors. Both follow the same fundamental logic and interpretation framework.
Step 1: Prepare and Explore Your Data
Before fitting a model, examine your data visually. Create a scatter plot of the outcome (Y axis) against the predictor (X axis). Look for a roughly linear pattern, which indicates that a straight line is an appropriate model. If the relationship is curved, consider transformations (log, square root) or polynomial terms. Identify any extreme outliers that might unduly influence the fitted line, and check whether both variables have adequate range and variability.
Verify that your data meets the basic requirements: both variables should be measured on interval or ratio scales (not ordinal or nominal), observations should be independent of each other (no repeated measures or clustered data without appropriate modeling), and the sample should be large enough to estimate coefficients reliably (a common guideline is at least 10-20 observations per predictor variable).
Step 2: Fit the Regression Model
Linear regression finds the line that minimizes the sum of squared residuals (the vertical distances between observed data points and the fitted line). This method, called ordinary least squares (OLS), produces the best linear unbiased estimates of the slope and intercept under standard assumptions.
The regression equation takes the form: Y = b0 + b1*X + error, where b0 is the intercept (the predicted Y value when X = 0), b1 is the slope (the expected change in Y for each one-unit increase in X), and the error term represents unexplained variation. Software computes these coefficients instantly, along with standard errors, t-statistics, and p-values for each coefficient.
Step 3: Interpret Coefficients
The slope (b1) is the most important output. It tells you the expected change in the dependent variable for each one-unit increase in the independent variable. If regressing exam scores on study hours yields a slope of 4.2, each additional hour of study is associated with a 4.2-point increase in exam scores on average. The sign indicates direction: positive slopes mean Y increases as X increases, negative slopes mean Y decreases as X increases.
The intercept (b0) is the predicted value of Y when all predictors equal zero. Sometimes this is meaningful (predicted salary at zero years of experience), sometimes it is not (predicted weight at zero height). When zero is outside the range of observed X values, the intercept serves a mathematical purpose but should not be interpreted substantively.
In multiple regression, each coefficient represents the expected change in Y for a one-unit increase in that predictor, holding all other predictors constant. This "controlling for" interpretation is what makes multiple regression so valuable: it isolates the unique contribution of each variable by accounting for the others.
Step 4: Evaluate Model Fit
R-squared (the coefficient of determination) measures the proportion of variance in Y that the model explains. It ranges from 0 (the model explains nothing) to 1 (the model explains all variation perfectly). An R-squared of 0.65 means the predictors collectively explain 65% of the variation in the outcome, with 35% remaining unexplained. In social sciences, R-squared values of 0.20-0.40 are common and useful. In physics or engineering, values above 0.95 are expected.
Adjusted R-squared penalizes for the number of predictors, preventing artificially inflated R-squared from adding irrelevant variables. It can decrease when a predictor adds no useful information. Use adjusted R-squared when comparing models with different numbers of predictors.
The F-test evaluates whether the model as a whole explains significantly more variance than a model with no predictors (just the mean). A significant F-test (p < 0.05) means that at least one predictor has a meaningful relationship with the outcome, though it does not tell you which ones.
Step 5: Check Assumptions and Diagnostics
Regression results are valid only when four key assumptions hold. Linearity: the relationship between predictors and outcome is linear (check with residual plots). Independence: residuals are not correlated with each other (violated in time series or clustered data). Homoscedasticity: the variance of residuals is constant across all levels of the predictors (check by plotting residuals against fitted values). Normality: residuals are approximately normally distributed (check with a Q-Q plot or histogram of residuals).
When assumptions are violated, remedies include: data transformations to achieve linearity, robust standard errors for heteroscedasticity, generalized least squares for correlated errors, and nonparametric methods when normality fails severely. Influential observations (outliers that substantially change the fitted line) should be investigated, possibly flagged, and reported. Cook's distance and leverage values identify such points.
Multiple Regression
When you add more predictors, the model becomes Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk + error. Each coefficient is adjusted for all other variables in the model. This allows you to answer questions like "what is the relationship between exercise and weight, controlling for age, gender, and diet?" The adjusted coefficient for exercise gives its unique contribution after accounting for the effects of the other variables.
Multicollinearity occurs when predictors are highly correlated with each other (for example, including both height in inches and height in centimeters). It inflates standard errors, making individual coefficients unreliable, even though the model's overall predictions remain accurate. Diagnose it with variance inflation factors (VIF), and address it by dropping redundant predictors or combining correlated ones.
Common Regression Pitfalls
Overfitting occurs when a model has too many predictors relative to sample size. The model fits the noise in the training data rather than the underlying signal, producing excellent fit statistics but poor predictions on new data. A useful guideline is to have at least 15 to 20 observations per predictor variable. Cross-validation, where you repeatedly fit the model on subsets of data and test predictions on the held-out portion, helps detect overfitting.
Extrapolation means using the model to predict outcomes for predictor values outside the range of the observed data. A regression of test scores on study hours between 1 and 10 hours cannot reliably predict what happens at 50 hours, because the linear relationship may not hold in that range. Always restrict predictions to the range of your data unless you have strong theoretical reasons to expect the relationship continues.
Logistic Regression
When the outcome is binary (yes/no, survived/died, purchased/did not purchase), linear regression is inappropriate because it can predict values outside the 0-1 range. Logistic regression models the log-odds of the event and produces coefficients interpreted as odds ratios. An odds ratio of 1.5 for a predictor means each one-unit increase multiplies the odds of the event by 1.5. Logistic regression is essential in medicine (predicting disease outcomes), marketing (predicting purchases), and any field with binary outcomes.
Regression analysis models relationships between variables by fitting lines (or curves) to data. The slope tells you how much Y changes per unit change in X, R-squared tells you how much variance the model explains, and diagnostic checks verify that results are trustworthy. Always check assumptions before trusting coefficients.