Understanding Re-identification Risk when Linking Multiple Datasets

Understanding Re-identification Risk when Linking Multiple Datasets

An article by Brian Rasquinha, Associate Director, Solution Architecture, Privacy Analytics

Many healthcare organizations pool datasets from different sources to build a more complete picture of a patient’s health and treatment. The resulting datasets are invaluable for gathering insights on patients, creating digital twins, creating AI models, testing hypotheses, and other analytics applications.

While organizations are often meticulous about acquiring only de-identified or anonymous data, they are sometimes less aware of the impacts of linking de-identified data.

When two or more de-identified datasets are combined, the risk that an anticipated recipient can identify an individual in the resulting dataset, alone or in combination with other reasonably available information, can increase. This potential increase in identifiability creates a privacy concern that data custodians must manage carefully, given that regulations (and, increasingly, individuals) expect data to be properly de-identified.

Consider, for example, the following scenario involving two datasets, each containing some fields that an adversary could reasonably use to attempt re-identification.

Dataset A Dataset B Datasets A and B Combined
Gender
Year of birth

Race

Marital Status
Gender
Year of birth
Post/ZIP Code

Ethnicity
Gender
Year of birth
Post/ZIP Code
Race
Ethnicity
Marital Status
De-identified
(very small risk of re-identification)
De-identified
(very small risk of re-identification)
What is the re-identification risk of the combined dataset?

When datasets A and B are combined, more information becomes available about each individual. As the number of identifiers associated with each individual increases, the risk increases that an anticipated recipient can identify an individual in the combined dataset. This is because the number of people sharing a set of identifying values (the group size) for most individuals is likely to decrease. Belonging to a smaller group means a higher chance that individuals in that group can be identified, resulting in a higher re-identification risk.

Linking de-identified datasets can have unanticipated privacy impacts and needs special consideration to be sure the linking is appropriately private. Being aware of the potential impacts is an important first step. Contact the experts at Privacy Analytics to learn more about assessing the risks after data linkage or about potential mitigation techniques.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.