Skip to content
Start main Content

Your Dataset Deserves Good Documentation

Datasets cannot speak for themselves. It is the way that you describe your research data and the methodology that matters. A clear documentation of your data ensures that the data can be interpreted correctly by yourself in future and by others. What’s more, it serves as evidence of the quality and reproducibility of your data. To get started with data documentation, perhaps the easiest approach is through a README file. 

README file is nothing new to us as you may have already seen many of these in the top-level directory of software packages. The file generally will include installation prerequisites & instructions, known issues, licenses, among many other things that the creator wants the users to pay attention to. Because of the importance of the information in a README file, GitHub repository automatically pulls out README file and presents it prominently on the main project web page. README file is not only for computer programs, in fact it can be created for datasets and research projects in all disciplines for similar purposes. 

README files are usually written in plain text (.txt) or in markdown (.md) which allows for some light formatting options. The file should be placed at the top level of your dataset folder so that readers can see it instantly. A README file should cover the following useful documentation areas:

  • General information, such as the title and creator of the dataset, date and geographic location of data collection, acknowledgements, contact information, brief description of what data it contains. 
  • Sharing and access information, such as licenses & restriction, links to publications, projects, and other datasets that use or cite the dataset, recommended citation for the data.
  • Data & file overview, such as a list of files in the dataset with a brief description and their relations, changelog, file naming convention, data/file format.
  • Methodological information, such as a description of methods used for data collection and processing, software requirements, experimental conditions, quality control criteria on the data.
  • Data specific information, such as number of variables, variables list, abbreviations and symbols, units of measurement, notes for missing data. 

You may explore the following sites for more examples:

— By Jennifer Gu, Library

Hits: 370

Go Back to page Top

Tags: ,

published March 31, 2021
last modified March 11, 2022