Reproducibility in Experiments: Why Replication Matters
Reproducibility vs. Replicability
These terms are often used interchangeably, but they refer to distinct concepts. Reproducibility means that the same team, using the same data and analysis, gets the same results. This is a minimum standard, essentially asking whether the computational and analytical steps are documented well enough to be repeated. Replicability means that a different team, collecting new data using the same methods, obtains consistent findings. This is a stronger standard because it tests whether the original finding reflects a genuine phenomenon rather than a one-time occurrence.
A study can be reproducible but not replicable. If the original analysis code and data are shared and anyone can re-run the analysis to get the same numbers, the study is reproducible. But if a new team collects fresh data and finds no effect, the original finding was not replicable, even though the analysis was reproducible. Both dimensions matter for scientific confidence.
The Replication Crisis
In 2015, the Open Science Collaboration published the results of an ambitious project that attempted to replicate 100 psychology studies published in top journals. Only 36 percent of the replications produced statistically significant results, compared to 97 percent of the original studies. The average effect size in the replications was roughly half the size reported in the original studies. This finding sent shockwaves through the scientific community and triggered serious examination of research practices across every discipline.
Similar replication problems have been documented in cancer biology (where the Reproducibility Project: Cancer Biology found that only about 50 percent of high-profile findings could be replicated), economics (where replication rates of influential experimental studies were around 60 percent), and pharmaceutical research (where Amgen scientists reported that only 6 of 53 landmark oncology studies could be confirmed).
Multiple factors contribute to irreproducibility. Publication bias favors significant results, creating a literature skewed toward positive findings, some of which are false positives. Small sample sizes produce unstable effect estimates that are unlikely to replicate. Flexible analysis practices (researcher degrees of freedom) allow unconscious selection of methods that produce significant results. Institutional incentives reward novel, significant findings over careful replication. Together, these factors create a system where published results are systematically more impressive than the underlying reality.
Making Your Experiments Reproducible
Pre-register your study protocol and analysis plan before collecting data. Pre-registration separates confirmatory analysis (testing pre-specified hypotheses) from exploratory analysis (discovering unexpected patterns). Both are valuable, but they must be labeled honestly. The pre-registration record, stored on platforms like OSF, AsPredicted, or ClinicalTrials.gov, provides public evidence of what was planned versus what was discovered.
Document every procedural detail. The Methods section of a paper should contain enough information for another researcher to replicate the study without contacting the original authors. This includes exact instrument specifications, software versions, stimulus parameters, instructions given to participants (verbatim, ideally), inclusion and exclusion criteria, and any deviations from the original protocol. Supplementary materials, protocol papers, and online repositories can accommodate details that do not fit in the main text.
Share your data and analysis code. Open data allows other researchers to verify your analyses, test alternative analytical approaches, and conduct meta-analyses. Open code eliminates ambiguity about how analyses were conducted. Repositories like GitHub, OSF, Zenodo, and Dryad provide permanent, citable storage for research materials. Data sharing should comply with ethical and privacy requirements, using de-identification or controlled access for sensitive data.
Use adequate sample sizes based on formal power analysis. Underpowered studies produce inflated effect sizes (because only the largest observed effects reach significance) and are unlikely to replicate. A study powered at 0.80 for the expected effect size has a reasonable chance of detecting the effect if it exists and of estimating its magnitude accurately.
Report results completely, including null findings, effect sizes, confidence intervals, and exact p-values. Selective reporting of only significant results creates a misleading impression of the strength and consistency of evidence. Journals that accept registered reports commit to publishing results regardless of significance, removing the incentive for selective reporting.
Tools and Platforms for Reproducibility
The Open Science Framework (OSF) provides a centralized platform for managing research projects, storing data, hosting pre-registrations, and sharing materials. Researchers can create project pages that link together pre-registrations, data files, analysis scripts, and the resulting publications, giving readers a complete audit trail from hypothesis to conclusion. OSF integrations with GitHub, Dropbox, and Google Drive allow researchers to connect existing workflows rather than rebuilding from scratch.
Containerization tools like Docker and Singularity capture the entire computational environment, including the operating system, software libraries, and specific package versions, needed to reproduce an analysis. A Dockerfile bundled with a research project ensures that anyone can recreate the exact computing environment the original team used, eliminating the common problem of analyses breaking because a software package was updated. The Whole Tale platform extends this concept by combining data, code, and the computational environment into a single publishable package.
Electronic lab notebooks (ELNs) replace traditional paper notebooks with timestamped digital records that automatically log every experimental observation, instrument reading, and protocol modification. Platforms like Benchling, LabArchives, and RSpace create searchable, shareable records that make it straightforward for other researchers to review exactly what was done and when. Unlike paper notebooks, ELNs can embed images, attach raw data files, and maintain version histories that prevent after-the-fact modification.
Version control systems like Git track every change to analysis code, including who made the change, when it was made, and why. A well-maintained Git repository provides a complete history of analytical decisions, including approaches that were tried and abandoned. When combined with platforms like GitHub or GitLab, version-controlled analysis code becomes publicly accessible and citable through services like Zenodo, which assigns persistent digital object identifiers (DOIs) to specific code versions.
Institutional and Journal-Level Reforms
Many journals now offer Registered Reports, a publication format where peer review occurs before data collection. Researchers submit their introduction, methods, and analysis plan for review. If the protocol passes peer review, the journal provides in-principle acceptance, committing to publish the results regardless of whether they are significant. This format eliminates publication bias at the journal level and gives researchers freedom to report null results without fear of rejection. Over 300 journals across disciplines now accept Registered Reports.
Funding agencies increasingly require data management and sharing plans as part of grant applications. The National Institutes of Health (NIH) implemented a Data Management and Sharing Policy in 2023 requiring that all NIH-funded research make scientific data available at the time of publication. The National Science Foundation (NSF) has required data management plans since 2011. These policies normalize data sharing as a standard part of the research process rather than an optional extra.
Replication studies, once dismissed as unoriginal, are gaining recognition as essential contributions to science. Journals like PLOS ONE explicitly welcome replication studies. The Psychological Science Accelerator coordinates large-scale, multi-site replications across dozens of laboratories worldwide, producing replication estimates that are far more reliable than any single laboratory could achieve. These collective efforts are building a more accurate picture of which findings are robust and which were artifacts of small samples or specific conditions.
Universities are beginning to reform hiring and promotion criteria to value open science practices alongside traditional metrics like publication count and journal impact factor. The San Francisco Declaration on Research Assessment (DORA) encourages institutions to evaluate research on its own merits rather than relying on journal-level metrics. Tenure committees that reward pre-registration, data sharing, and replication studies create incentives that align individual career advancement with the collective goal of reproducible science.
Reproducibility is built into experiments through pre-registration, transparent reporting, adequate power, and open sharing of data and methods. These practices protect against the biases and shortcuts that have produced the replication crisis.