Skip to content
Start main Content

Dataset Citation – When and How

Research data and datasets are becoming increasingly important in the scientific process. As more researchers make their data open and reusable, proper citation is crucial to give credit to the authors and acknowledge the data origin.

Key Elements in a Dataset Citation

While currently there are no universal standards for citing datasets, in principle a dataset citation should include the following elements to provide sufficient specifications that are both human understandable and machine-actionable:

  • Author(s)
  • Title
  • Year
  • Data repository (or distributor)
  • Version (if any)
  • Globally unique identifier

Major data repositories usually standardize the citation of datasets to make it easier for researchers to publish their data and get credit. For example, DataSpace@HKUST, our university’s data repository, automatically generates a citation when a new dataset or version is published. The citation includes a Digital Object Identifier (DOI) URL, which serves as the dataset homepage. Since DOI is designed to persist, it won’t change even if the webpage URL changes in future.

DataSpace@HKUST citation record

Recommended citation format will be generated automatically in DataSpace@HKUST. (Example)

When to Cite Dataset?

One question we often receive is when to cite a dataset versus when to cite the publication that presents the data. The American Psychological Association Publication Manual, also known as APA Style, has a useful guidance on this. Generally you are expected to provide citations for datasets when you have either conducted secondary analyses of publicly archived data or archived your own data that are being presented for the first time in your current work (for example, these authors cite their own dataset in their paper). However, if you are simply citing existing data or statistics, it’s best to cite the publication in which the data were originally published (e.g. a journal article, report if available) rather than citing the data itself.

In addition to dataset citation, including a “data availability statement” in your publication is another way to ensure that your data sources are appropriately acknowledged. A data availability statement is a brief statement that describes how others can access the data underlying your study. You may go to our previous blogpost for more information on this topic.

Dataset DOI Versioning and Its Citations

Datasets are often revised or updated, which can affect their citation. When citing datasets with multiple versions, it’s important to note the year of publication for the version of the data and provide the version number in your citation. This allows others to know specifically which version you used and it’s good for data validation or reproducibility.

Regarding DOIs, most data repositories keep the same DOIs for all versions of the dataset. However, the Zenodo data repository supports DOI versioning. This means that it generates one DOI per version of the dataset and it also assigns a Concept DOI that represents all versions of the record. All per-version DOIs are semantically linked to this Concept DOI. 

Zenodo Versioning Doi

Zenodo supports DOI versioning, which lets you cite DOI of an exact version or all versions as a concept. (Example)

This DOI versioning feature is particularly useful for those who need to cite a specific version of a dataset or all versions as a concept. If you find this feature useful for your work, consider exploring the Zenodo data repository for your datasets.

– By Jennifer Gu, Library

Hits: 377

Go Back to page Top

Tags: , , , , ,

published March 24, 2023