Nature Research Academies: Sharing Sensitive Data (Nov 2020)

Research Bridge

Researchers' Series

Last week, Dr. Rebecca Grant from Nature Research Academies held a very informative webinar for HKUST researchers on managing sensitive research data. Here we revisit a few important tips.

In this 2-hour webinar, held on November 11, 2020, Rebecca discussed what makes research data sensitive, things you should consider when sharing sensitive data, and some practical techniques for ensuring that sensitive data can be shared.

1. Why we should know how to share sensitive data

More and more, researchers should expect to receive data sharing requests from journal editors, reviewers and readers. It is therefore important that we are able to recognize if our data is sensitive; and to know how to prepare for sharing data with risks minimized. Quoting Rebecca:

"You can share your data safely if you understand the risks, the rules and the legislation that applies to you, and if you know how to prepare your data to protect research participants and other sensitive elements"

Sharing research data contributes to public good, increases reproducibility of your research, and brings in more citations. These benefits also apply to sensitive data. In addition, some sensitive data are difficult to collect; for example, data about vulnerable cohorts or rare diseases. In these cases, sharing data for reuse is even more valuable.

2. What to consider when sharing sensitive data

Rebecca's talk focused on data of human subjects. She showed us how to recognize direct identifiers and indirect identifiers. The former is information that allows clear identification of individuals (e.g. fingerprint, names); the latter is information that allows the identification of individuals through combination with other available information (e.g. combining a uncommon job title, a workplace, and ethnicity may easily lead to identification of a particular person).

When thinking about data sharing, you would consider:

can a person be identified by combining the indirect identifiers, or combining your dataset with another dataset?
are there ethical or legal reasons that could lead to participants facing harm or discrimination?

3. How to protect sensitive data

You should start thinking about sharing your sensitive data before you begin your research. You need to obtain permission, prepare your dataset, and consider proper access control.

Getting permission

you need ethical clearance from your institutional review board. At HKUST, you can consult the review process for research practices
you must get permission from participants; make sure they understand how you will process the data and how you will share them. It is not ethical to share without permission, even if you think that they are not individually identifiable
do not plan to collect sensitive data if it is not required for your research
keep record of consent you obtain. Consult consent forms regularly to ensure your purposes have not changed

Preparing the data for sharing

There are various ways to reduce the risks of individuals being identifiable:

aggregate your data to the point where individuals are not recognizable
remove personal identifiers from the data using anonymization or de-identification

Anonymization irreversibly prevents the possibility of future re-identification, even by the original researchers. De-identification can be reversed, allowing re-identification. Techniques to de-identify include:

Removal: eliminate the identifier from the dataset
Pseudonymisation: replace identifiers with other values
Generalization: make the data less precise, e.g. by using age ranges, salary bands, regions or city names instead of addresses
Character masking: e.g. display an email address as ****@ust.hk
Swapping: mix values relating to a variable among participants, so that one individual cannot be identified through the data

Controlling access

Some data are intrinsically difficult or even impossible to anonymize; e.g., images, audio recordings, interviews of specific people. You may have to regulate the sharing in a controlled way. You can consider:

use controlled access repositories; they make datasets findable but not openly accessible. Users have to apply for access
apply embargoes; e.g., data is not disclosed until X years later, or after the death of participants

4. Making decisions

You need to make a lot of choices when managing sensitive research data. Rebecca shared some key issues throughout the talk:

Choosing how to de-identify data: any level of de-identification unavoidably reduces the usefulness of the data. You should balance the usefulness of the shared dataset and the protection for your participants. It is also very important to document your decisions and process.
Considering if the shared dataset is still good for research validation or reproduction: removing identifiers or aggregating variables must have an impact on the usefulness of the data. You would need to consider the effect of the techniques you choose. e.g. if you use swapping, your whole dataset may still be validated by other researchers while participants' identities are protected.
Protecting participants is of utmost importance for researchers to maintain trust. If you have concern, perhaps you should not share the data.

Managing and sharing sensitive research data is a delicate balancing act. We all learned a lot from Rebecca through this well-paced talk.

Published

18 Nov 2020