Types of Validity in Experiments: Internal, External, Construct, and More
Internal Validity
Internal validity is the extent to which an experiment establishes a trustworthy cause-and-effect relationship between the independent and dependent variables. High internal validity means that changes in the outcome can be confidently attributed to the treatment rather than to confounding factors, methodological flaws, or chance.
Threats to internal validity include history (external events occurring during the experiment that affect the outcome), maturation (participants changing naturally over time), testing effects (participants performing differently on a posttest because of exposure to the pretest), instrumentation (changes in the measurement instrument or observers over time), regression to the mean (extreme scores tending to move toward the average on retest), selection bias (pre-existing differences between groups), attrition (differential dropout between groups), and diffusion of treatment (control participants learning about or accessing the treatment).
Randomization is the strongest defense against most threats to internal validity because it distributes both known and unknown confounding variables approximately equally across groups. Blinding prevents expectation effects from masquerading as treatment effects. Standardized procedures prevent instrumentation drift. Measuring attrition rates and comparing completers to dropouts helps assess whether differential attrition has biased the results.
Internal validity is typically highest in tightly controlled laboratory experiments, where the researcher can manipulate the independent variable precisely and minimize extraneous influences. Field experiments and quasi-experiments sacrifice some internal validity in exchange for greater realism and practical relevance.
External Validity
External validity, also called generalizability, is the extent to which experimental findings apply to other populations, settings, treatments, and outcomes beyond those studied. A study conducted on American college students in a university laboratory has questionable external validity for elderly Japanese adults in a clinical setting, because the participants, the environment, and the cultural context all differ.
Population validity asks whether results generalize to people who were not in the study. If participants were recruited from a university subject pool, the results may not apply to the general population, who differ in age, education, socioeconomic status, and motivation. Representative sampling improves population validity, but experiments rarely use random sampling from the population because it is logistically impractical. Instead, researchers replicate findings across diverse samples to build evidence for generalizability.
Ecological validity asks whether results generalize to real-world settings. Laboratory tasks are often simplified, artificial versions of the real-world behaviors they represent. A memory test using lists of unrelated words may not predict how well people remember information encountered in natural contexts. Field experiments improve ecological validity by studying behavior in natural settings, but they sacrifice the control that makes laboratory experiments internally valid.
Temporal validity asks whether results generalize across time. Findings from the 1960s may not hold in the 2020s if social norms, technology, or environmental conditions have changed. Replication over time is the only way to establish temporal validity. This is particularly important for social and behavioral research, where cultural shifts can rapidly change the phenomena being studied.
Construct Validity
Construct validity evaluates whether the independent variable manipulation and the dependent variable measurement actually represent the theoretical constructs they are intended to capture. If a study claims to measure "intelligence" using a 10-question vocabulary test, construct validity asks whether a vocabulary test adequately captures the broad, multidimensional concept of intelligence.
Convergent validity, one component of construct validity, is demonstrated when a measure correlates strongly with other measures of the same construct. If your anxiety measure correlates highly with established anxiety scales, it shows convergent validity. Discriminant validity is demonstrated when a measure does not correlate strongly with measures of different constructs. If your anxiety measure correlates equally with depression, happiness, and extraversion scales, it may be measuring general emotional arousal rather than anxiety specifically.
Face validity, the simplest form, asks whether a measure looks like it measures what it claims. A math test full of arithmetic problems has high face validity for measuring math ability. However, face validity is the weakest evidence for construct validity because appearances can be deceiving. A test that looks like it measures leadership might actually measure confidence, extraversion, or social desirability.
Content validity asks whether the measure covers all aspects of the construct. A depression scale that asks about sadness and hopelessness but ignores sleep disturbance, appetite changes, and concentration problems has limited content validity because it covers only some symptoms of the construct. Expert panels who review the items against the theoretical definition of the construct typically assess content validity.
Statistical Conclusion Validity
Statistical conclusion validity concerns whether the statistical analysis correctly identifies the presence or absence of a relationship between variables. It is threatened by low statistical power (failure to detect a real effect due to small samples), violated statistical assumptions (using parametric tests on non-normal data without appropriate transformations), inflated Type I error rates (from multiple comparisons without correction), and unreliable measurements (which attenuate effect sizes and reduce power).
Fishing expeditions, where researchers run dozens of statistical tests until one produces a significant result, directly threaten statistical conclusion validity by capitalizing on chance. Pre-registration, where the analysis plan is specified before data collection, is the primary defense against this threat. Sensitivity analyses, which test whether conclusions change under different reasonable analytical choices, provide additional evidence for the robustness of the findings.
Validity in Practice: Balancing Trade-offs
In real research, maximizing one type of validity often comes at the expense of another. Laboratory experiments maximize internal validity through tight control over variables, random assignment, and standardized procedures. But the artificial laboratory setting may limit external validity: participants behaving in a controlled room under observation may not behave the same way in their natural environment. Field experiments sacrifice some internal control for greater ecological validity, studying behavior in natural settings where confounds are harder to eliminate but the results are more directly applicable to real-world situations.
Construct validity is threatened whenever the operational definition of a variable does not fully capture the theoretical construct. Measuring intelligence with a single IQ test, measuring depression with a brief screening questionnaire, or measuring classroom engagement with attendance records all involve some gap between the construct as theorized and the construct as measured. Multiple operationalizations, where the same construct is measured in several different ways, strengthen construct validity because the convergence of different measures provides stronger evidence that the underlying construct, rather than the specific measurement method, is responsible for the observed results.
Statistical conclusion validity concerns whether the statistical analyses support the conclusions drawn from the data. Violations of statistical assumptions (non-normality, heteroscedasticity, non-independence of observations), low statistical power, inflated Type I error from multiple testing, and unreliable measurements all threaten statistical conclusion validity. These threats can lead to either false positives (concluding an effect exists when it does not) or false negatives (concluding no effect exists when one does). Careful attention to statistical assumptions, adequate sample sizes, and appropriate correction for multiple comparisons protects statistical conclusion validity.
The practical solution is to acknowledge the trade-offs explicitly and design studies that prioritize the types of validity most relevant to the research question. A treatment efficacy study needs strong internal validity to establish that the treatment causes the observed improvement. A subsequent effectiveness study needs strong external validity to demonstrate that the treatment works in routine clinical practice with diverse patients and real-world conditions. Together, the two types of studies provide complementary evidence that neither could provide alone.
No single study maximizes all types of validity simultaneously. Tight laboratory control improves internal validity but may limit external validity. The goal is to understand which types of validity are strongest and weakest in your design and to interpret results accordingly.