Research Data Management
Why Data Management Matters
Poor data management is one of the most common and preventable sources of error in research. Files with cryptic names stored in disorganized folders, datasets without documentation of variable definitions, analyses run on outdated versions of data, and backups that do not exist or have not been tested, all of these problems can waste time, introduce errors, and in worst cases, make it impossible to verify or reproduce published findings.
Major funding agencies including the National Institutes of Health, the National Science Foundation, and the European Research Council now require data management plans as part of grant applications. Many journals require that data supporting published findings be made available to other researchers. These requirements reflect growing recognition that research data are valuable assets that should be managed with the same care applied to other aspects of research methodology.
Step 1: Create a Data Management Plan
A data management plan (DMP) describes how data will be handled throughout the research lifecycle, from collection through analysis to archiving or disposal. The plan should specify what types of data will be generated, what formats and standards will be used, how data will be organized and documented, where data will be stored during and after the project, who will have access, how participant privacy will be protected, and how data will be shared or preserved for future use.
Write the DMP before data collection begins and treat it as a living document that is updated as the project evolves. Tools like DMPTool and DMPonline provide templates aligned with major funders requirements. Even when a formal DMP is not required, creating one improves the organization and quality of your data management practices.
Step 2: Organize and Name Files Consistently
Establish naming conventions before the first file is created. Good file names are descriptive, consistent, and machine-readable. Include relevant information such as the project name, data type, date, and version number. Avoid spaces and special characters in file names, as these can cause problems across different operating systems and software. Use leading zeros in sequential numbering (01, 02, 03 rather than 1, 2, 3) so that files sort correctly.
Organize files into a logical folder hierarchy that separates raw data from processed data, analysis scripts from output, and documentation from everything else. A raw data folder should be treated as read-only after collection is complete, ensuring that the original data are always available as a reference. All derived datasets should be reproducible from the raw data using documented procedures.
Step 3: Document Everything
Documentation is what makes data usable by anyone other than the person who collected them, including your future self. A codebook or data dictionary defines every variable in the dataset, including its name, definition, measurement units, allowable values, and coding scheme. Procedural documentation describes how data were collected, what instruments or equipment were used, and how quality checks were performed. A README file in each project folder provides an overview of the project and explains the folder structure.
Use standardized metadata formats when available. Discipline-specific metadata standards exist for many fields, such as the Dublin Core for general research data, DDI for social science survey data, and EML for ecological data. Standardized metadata improve discoverability and interoperability, making it easier for other researchers to find and use your data.
Step 4: Implement Version Control and Backups
Version control systems like Git track every change to files, recording who made each change, when, and why. For analysis code and scripts, version control is essential because it allows you to reproduce any previous version of your analysis and trace the evolution of your analytical decisions. For datasets, version control or systematic versioning practices (such as appending date stamps to modified files) ensure that you always know which version of the data was used for any particular analysis.
Back up data regularly to at least two separate locations, ideally including one off-site or cloud-based backup. Test your backups periodically by verifying that you can actually restore files from them. Automated backup systems are preferable to manual procedures because they do not depend on someone remembering to run the backup.
Step 5: Share and Preserve Data
Sharing research data enables verification of published findings, facilitates new analyses and discoveries, reduces wasteful duplication of data collection, and fulfills obligations to funders and the public. Before sharing, de-identify data by removing or masking personal identifiers, and assess whether any residual re-identification risk exists. Choose a trusted data repository appropriate to your discipline, such as Dryad, Figshare, ICPSR, or a domain-specific repository.
Assign a persistent identifier (such as a DOI) to deposited datasets so that they can be cited and tracked. Apply a clear license (such as Creative Commons CC-BY) that specifies how others may use the data. Include comprehensive documentation with the deposit so that users can understand and work with the data without needing to contact you.
Data Governance and Compliance
Data governance encompasses the policies, procedures, and responsibilities that ensure research data are managed consistently and in compliance with legal, ethical, and institutional requirements. Institutions typically have data governance committees or offices that establish standards for data classification, access control, retention, and disposal. Researchers must understand where their data fall on the sensitivity spectrum, from publicly available survey responses that pose minimal risk to identifiable health records protected by laws like HIPAA in the United States or GDPR in Europe.
Compliance with data protection regulations requires attention from the earliest stages of research design. Data protection impact assessments, required under GDPR for research involving personal data, evaluate the risks of data processing activities and identify measures to mitigate those risks. Consent forms must accurately describe how data will be used, stored, shared, and eventually destroyed. International collaborations face additional complexity because data transferred across borders may be subject to the regulations of multiple jurisdictions simultaneously.
Data archiving and sharing obligations are becoming standard requirements for funded research. Many funders now require that data management plans be submitted with grant applications and that research data be deposited in approved repositories upon study completion. Preparing data for archiving requires thorough documentation, de-identification of sensitive information, application of appropriate access restrictions, and selection of repositories that provide long-term preservation and discoverability. Researchers who plan for archiving from the beginning save substantial effort compared to those who attempt to prepare data for sharing after the project has ended.
Version control for data and analysis scripts is an often-overlooked aspect of research data management. Just as software developers use version control systems like Git to track changes to code, researchers benefit from tracking changes to datasets and analysis scripts so that any result can be reproduced from the specific version of data and code that generated it. This practice is especially important when data cleaning decisions are made iteratively or when multiple team members are working with the same dataset.
Research data management is not an afterthought but an integral part of the research process. Planning your data management before collection begins, maintaining consistent organization and documentation throughout, and preparing data for sharing and preservation at the end of the project all contribute to the credibility, reproducibility, and impact of your research.