Deep-Diving into Re-identification: Perspectives On An Article In Nature Communications
by Dr. Khaled El-Emam – General Manager of Privacy Analytics
Recently, an article in Nature Communications caught our attention, primarily because it discussed re-identification risk. It also happened to catch the attention of media.
Misconceptions can get amplified
Unfortunately, media interpretation of the article resulted in misconceptions about re-identification risk being amplified.
In fairness to the Nature authors and their article, the media headlines generated were inconsistent with the paper itself, and perhaps stemmed from the reporting of verbal comments by the authors and press releases from their institutions, from which headline-worthy points were made and conveyed.
So what exactly was said in the article? Let’s get into that now.
First off, the article is titled Estimating the success of re-identifications in incomplete data using generative models by authors Luc Rocher, Julien M. Hendrickx and Yves-Alexandre de Montjoye. It was, as mentioned, published in Nature Communications, on July 23, 2019.
What did Rocher et al focus on?
The Nature authors’ focus in the article is a new method they developed for estimating population uniqueness from sample uniqueness.
Population uniqueness is a classic measure of re-identification risk, with multiple decades of research on estimators. The authors present several empirical studies that show their estimator to be very accurate. They then apply their estimator to various datasets and scenarios to make more general commentary on re-identification risk (as measured by population uniqueness).
Academics often treat uniqueness and re-identification synonymously
Before going further, it is important to note that estimating uniqueness is not the same as an actual re-identification. When real world data errors and other real-world factors are considered, the likelihood of a correct re-identification is often an order of magnitude lower. Academic researchers frequently treat uniqueness synonymously with re-identification, but there is a nontrivial difference in practice. Nevertheless, as part of a de-identification exercise, one would want to ensure that uniqueness is low.
“Release-and-forget” is not a current go-to model
The underlying assumption in the analysis in this paper was that of a “release-and-forget” model. This is suitable when the data are released publicly with no controls. But the current state of practice for non-public data uses and disclosures is not release-and-forget.
The need to stay in control
In practice, for non-public releases, data consumers/recipients are required to implement controls to manage re-identification risk. It is the totality of data risk and the residual risk after implementing these controls that matters. The authors did not consider the role of controls to manage risk, and therefore their risk estimates are inflated. Again, controls can reduce re-identification risk by an order of magnitude in practice.
Thoughts on sub-sampling
The authors make the general point that sub-sampling may not be a very effective approach to managing re-identification risk, though note that sub-sampling provides plausible deniability. (The paper describes how an adversary can’t be sure they have re-identified the right person even if they found a match, providing plausible deniability.)
This limitation to sub-sampling is a well-known phenomenon and the reason why most sub-samples from national statistical agencies tend to be very small (1% to 2%). Even with small sub-samples, the beneficial impact can be limited if you can determine which records in that small sub-sample are also unique in the population.
Observations about the authors’ analysis
The authors make several noteworthy points:
- A sample unique is at a high risk of re-identification. This is true in that sample uniques will have a higher risk of re-identification than non-uniques. A sample unique is more likely to be a population unique, and population uniques are more highly re-identifiable than non-uniques.
- Even if sample uniqueness is low, the likelihood of a match can still be high for non-sample uniques. This is because doubles also have a 0.5 probability of being re-identified in a release-and-forget model.
- The more quasi-identifiers included in the dataset, the higher the estimated population uniqueness.
Moving from correct to problematic observations
These preceding observations are correct and generally understood in the disclosure control community.
However, there are some additional points about the paper that are problematic:
- All the analysis that the authors performed was on pseudonymous data. That is why the uniqueness values that they were getting tended to be high. No actual de-identification was performed on the data. It is not a surprise that the probability of re-identification, however measured, will be high on raw/original data. This is an important detail missed in the overall narrative on this paper.
- Based on our own data and experience, there are, on average, seven demographic and socio-economic quasi-identifiers in a health dataset. Therefore, the assumed 15 demographic and socio-economic quasi-identifiers from this paper is exaggerated and not consistent with what we see in real data.
How does this article relate to the Privacy Analytics methodology?
This paper has little impact on the Privacy Analytics methodology and instead supports the methodology in many regards. The paper either does not account for important elements of the methodology or emphasizes a point that our methodology already addresses.
- The Privacy Analytics methodology does not use sub-sampling as a method to manage the risk of re-identification due to known limitations of sub-sampling as a risk control measure.
- We do not generally support the release-and-forget model for data uses and disclosures. Our methodology has always incorporated a set of controls that are considered when assessing the overall risk. The only exception is public data releases. (More details on this below.)
- By default, the Privacy Analytics methodology uses the strict average risk metric, and we incorporate uniqueness in our risk measurement for demographic variables (including all the ones mentioned in the paper). Our uniqueness estimates account for sub-sampling effects.
- As part of the Privacy Analytics methodology, if the risk is found to be high then additional transformations are applied to the data or additional controls are put into place. Therefore, even in situations where the raw data has a high risk of re-identification, the second step modifies the conditions to reduce the risk.
- For public data releases, we use the maximum risk methodology. For public releases, we apply generally recommended population group sizes of 11. Therefore, there should be no population uniques in the dataset.
A few additional problematic statements
The paper makes some incorrect characterizations of the literature. For example, the Australian data release incident was referred to as a re-identification of patients; in actuality, it was the doctor IDs that were reversed. (No patients were re-identified.) The paper also refers to the Weld example, which was pre-HIPAA and does not reflect existing regulations. A German re-identification example provided in the paper was performed on pseudonymized data.
The bottom line?
The authors did develop an accurate method for estimating population uniqueness from a sample. This estimator could potentially be an important contribution to the field of risk estimation (pending further validation, replication and assessment of whether the assumptions are practical).
Beyond that, however, the authors’ other analysis makes a limited contribution to our understanding of re-identification risk or how to manage it.
You might also like:
When to Integrate Anonymization of Documents and Data
by Sarah Lyons – Senior Director, Operations Should sponsors integrate risk-based anonymization across both documents and structured individual patient data…
Comparing the benefits of pseudonymisation and anonymisation under the GDPR
Journal of Data Protection & Privacy: new White Paper by Mike Hintze and Dr. Khaled El Emam From the Publisher, Henry…
Learning at Scale: Anonymizing Unstructured Data using AI/ML
by Rachel Li, Ph.D. – Senior Machine Learning Engineer Unstructured data, such as medical notes, pose unique challenges with regards…
GDPR and The Future of Clinical Trials Data Sharing
How will GDPR continue to impact the future of Clinical Trials Data Sharing? Insights from Privacy Analytics’ Executive Roundtable Event:…
- Turn Data Assets into Business Opportunity Under CCPADecember 19, 2019
- Can you comply your way to greatness?November 21, 2019
- When to Integrate Anonymization of Documents and DataSeptember 26, 2019
- Deep-Diving into Re-identification: Perspectives On An Article In Nature CommunicationsSeptember 26, 2019
- Learning at Scale: Anonymizing Unstructured Data using AI/MLSeptember 26, 2019
- GDPR and The Future of Clinical Trials Data SharingMarch 18, 2019
- Advancing Principled Data Practices in Support of Emerging TechnologiesMarch 15, 2019
- “Zero Risk Does Not Exist”February 7, 2019
- Is Anonymization Possible with Current Technologies?January 9, 2019
- Comparing the benefits of pseudonymisation and anonymisation under the GDPRDecember 20, 2018