Research data and datasets are becoming increasingly important in the scientific process. As more researchers make their data open and reusable, proper citation is crucial to give credit to the authors and acknowledge the data origin.
Key Elements in a Dataset Citation
While currently there are no universal standards for citing datasets, in principle a dataset citation should include the following elements to provide sufficient specifications that are both human understandable and machine-actionable:
- Author(s)
- Title
- Year
- Data repository (or distributor)
- Version (if any)
- Globally unique identifier
Major data repositories usually standardize the citation of datasets to make it easier for researchers to publish their data and get credit. For example, DataSpace@HKUST, our university’s data repository, automatically generates a citation when a new dataset or version is published. The citation includes a Digital Object Identifier (DOI) URL, which serves as the dataset homepage. Since DOI is designed to persist, it won’t change even if the webpage URL changes in future.
When to Cite Dataset?
One question we often receive is when to cite a dataset versus when to cite the publication that presents the data. The American Psychological Association Publication Manual, also known as APA Style, has a useful guidance on this. Generally you are expected to provide citations for datasets when you have either conducted secondary analyses of publicly archived data or archived your own data that are being presented for the first time in your current work (for example, these authors cite their own dataset in their paper). However, if you are simply citing existing data or statistics, it’s best to cite the publication in which the data were originally published (e.g. a journal article, report if available) rather than citing the data itself.
In addition to dataset citation, including a “data availability statement” in your publication is another way to ensure that your data sources are appropriately acknowledged. A data availability statement is a brief statement that describes how others can access the data underlying your study. You may go to our previous blogpost for more information on this topic.
Dataset DOI Versioning and Its Citations
Datasets are often revised or updated, which can affect their citation. When citing datasets with multiple versions, it’s important to note the year of publication for the version of the data and provide the version number in your citation. This allows others to know specifically which version you used and it’s good for data validation or reproducibility.
Regarding DOIs, most data repositories keep the same DOIs for all versions of the dataset. However, the Zenodo data repository supports DOI versioning. This means that it generates one DOI per version of the dataset and it also assigns a Concept DOI that represents all versions of the record. All per-version DOIs are semantically linked to this Concept DOI.
This DOI versioning feature is particularly useful for those who need to cite a specific version of a dataset or all versions as a concept. If you find this feature useful for your work, consider exploring the Zenodo data repository for your datasets.
– By Jennifer Gu, Library
Hits: 422
Go Back to page Top
- Category:
- Research Data Management Tips
Tags: Data Repository, dataset citation, dataset versioning, DOI, research data, Zenodo
published March 24, 2023