Many healthcare organizations pool datasets from different sources to build a more complete picture of a patient’s health and treatment. The resulting datasets are invaluable for gathering insights on patients, creating digital twins, creating AI models, testing hypotheses, and other analytics applications.
While organizations are often meticulous about acquiring only de-identified or anonymous data, they are sometimes less aware of the impacts of linking de-identified data.
When two or more de-identified datasets are combined, the risk that an anticipated recipient can identify an individual in the resulting dataset, alone or in combination with other reasonably available information, can increase. This potential increase in identifiability creates a privacy concern that data custodians must manage carefully, given that regulations (and, increasingly, individuals) expect data to be properly de-identified.
Consider, for example, the following scenario involving two datasets, each containing some fields that an adversary could reasonably use to attempt re-identification.
Dataset A | Dataset B | Datasets A and B Combined |
---|---|---|
Gender Year of birth Race Marital Status |
Gender Year of birth Post/ZIP Code Ethnicity |
Gender Year of birth Post/ZIP Code Race Ethnicity Marital Status |
De-identified (very small risk of re-identification) |
De-identified (very small risk of re-identification) |
What is the re-identification risk of the combined dataset? |
When datasets A and B are combined, more information becomes available about each individual. As the number of identifiers associated with each individual increases, the risk increases that an anticipated recipient can identify an individual in the combined dataset. This is because the number of people sharing a set of identifying values (the group size) for most individuals is likely to decrease. Belonging to a smaller group means a higher chance that individuals in that group can be identified, resulting in a higher re-identification risk.
Linking de-identified datasets can have unanticipated privacy impacts and needs special consideration to be sure the linking is appropriately private. Being aware of the potential impacts is an important first step. Contact the experts at Privacy Analytics to learn more about assessing the risks after data linkage or about potential mitigation techniques.