When linking datasets from different sources to build out a rich and detailed data lake, some less obvious privacy and compliance impacts can arise, even if the datasets being linked are de-identified.
As explored in a previous article, when two or more de-identified datasets are combined, the risk that an anticipated recipient can identify an individual in the resulting dataset, alone or in combination with other reasonably available information, can increase. Here, we explore some approaches that data custodians can use to mitigate the re-identification risk and ensure the privacy of individuals in the dataset.
Using a Data Sharing Agreement to Mitigate Unanticipated Linking
In some situations, a data custodian is sharing a de-identified dataset with a recipient and needs to ensure that the recipient doesn’t compromise the privacy of the individuals in the dataset. If the recipient were to perform further linking in a way the original custodian did not anticipate or account for, re-identification risks could be introduced. To preserve as much utility in a de-identified dataset as possible, a data custodian should consider contractually limiting the scenarios under which the recipient can link the de-identified dataset with other data (i.e., in a data sharing agreement).
In the case of a public release, data custodians must consider that linking against public or non-public datasets cannot be mitigated by contractual controls.
Linking Datasets and Assessing Re-identification Risk
In other situations, a data custodian may want to link two or more de-identified datasets. As a first step, the custodian must establish the appropriate permissions and authority to perform this linking. Assuming the appropriate authority is in place, special policy and technical considerations are required: given that re-identification risk can increase when linking two de-identified datasets, the result after linking may no longer be considered de-identified. If the process of linking renders the data more identifiable than the permissible threshold, the linking could be in breach of privacy regulations and policy.
A privacy expert can determine the re-identification risks associated with the linked dataset and make recommendations for data transformations or increased protective measures on the data release context (e.g., security or contractual controls) if/as necessary to manage the re-identification risk of the resulting dataset.
Choosing individual IDs and tokenization carefully
Most de-identified datasets incorporate some form of individual ID to uniquely distinguish an individual within the dataset. Increasingly, these de-identified individual IDs are assigned using third-party tokenization engines to enable potential linking with other assets. Careful consideration of the engines and secret keys used to assign de-identified IDs will allow more control over the straightforward linking of de-identified data. Our next article will explore definitions and considerations for tokenization in more detail.
The privacy impacts of linking de-identified datasets can be managed effectively when they are well-understood and carefully handled. Contact the experts at Privacy Analytics to learn more or discuss data linkages your organization is pursuing.