Managing Risks Associated with Linking De-identified Datasets

Managing Risks Associated with Linking De-identified Datasets

An article by Brian Rasquinha, Associate Director, Solution Architecture, Privacy Analytics

When linking datasets from different sources to build out a rich and detailed data lake, some less obvious privacy and compliance impacts can arise, even if the datasets being linked are de-identified.

As explored in a previous article, when two or more de-identified datasets are combined, the risk that an anticipated recipient can identify an individual in the resulting dataset, alone or in combination with other reasonably available information, can increase. Here, we explore some approaches that data custodians can use to mitigate the re-identification risk and ensure the privacy of individuals in the dataset.

Using a Data Sharing Agreement to Mitigate Unanticipated Linking

In some situations, a data custodian is sharing a de-identified dataset with a recipient and needs to ensure that the recipient doesn’t compromise the privacy of the individuals in the dataset. If the recipient were to perform further linking in a way the original custodian did not anticipate or account for, re-identification risks could be introduced. To preserve as much utility in a de-identified dataset as possible, a data custodian should consider contractually limiting the scenarios under which the recipient can link the de-identified dataset with other data (i.e., in a data sharing agreement).

In the case of a public release, data custodians must consider that linking against public or non-public datasets cannot be mitigated by contractual controls.

Linking Datasets and Assessing Re-identification Risk

In other situations, a data custodian may want to link two or more de-identified datasets. As a first step, the custodian must establish the appropriate permissions and authority to perform this linking. Assuming the appropriate authority is in place, special policy and technical considerations are required: given that re-identification risk can increase when linking two de-identified datasets, the result after linking may no longer be considered de-identified. If the process of linking renders the data more identifiable than the permissible threshold, the linking could be in breach of privacy regulations and policy.

A privacy expert can determine the re-identification risks associated with the linked dataset and make recommendations for data transformations or increased protective measures on the data release context (e.g., security or contractual controls) if/as necessary to manage the re-identification risk of the resulting dataset.

Choosing individual IDs and tokenization carefully

Most de-identified datasets incorporate some form of individual ID to uniquely distinguish an individual within the dataset. Increasingly, these de-identified individual IDs are assigned using third-party tokenization engines to enable potential linking with other assets. Careful consideration of the engines and secret keys used to assign de-identified IDs will allow more control over the straightforward linking of de-identified data. Our next article will explore definitions and considerations for tokenization in more detail.

The privacy impacts of linking de-identified datasets can be managed effectively when they are well-understood and carefully handled. Contact the experts at Privacy Analytics to learn more or discuss data linkages your organization is pursuing.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.