Guess again: De-identification does work
As I scan the media, I tend to notice a lot of arguments around the notion that de-identification does not work or that you can’t de-identify health data. This blog post serves as my standard rebuttal to those very arguments.
The central argument by the naysayers tends to center around re-identification attacks of data that were not properly de-identified in the first place. When presented with this type of argument, the first question should be, what type of data was it? Then, has it been de-identified properly? If the response doesn’t involve mention of a specific standard or discussion around risk then I know that the logic is flawed since those two elements are fundamental to a properly de-identified dataset.
There have been a small number of examples, through real and commissioned attacks, where the data has been de-identified properly (using a particular standard or methodology) and the success rate was very small, varying from 0 to a very small number, 0.013% in one case. So the narrative around the “de-identification doesn’t work” argument is faulty in that the story is being retold in the absence of actual evidence.
Why is the argument so prevalent?
I think part of the problem can be attributed to the concept of “confirmation bias” found in behavioral economics. Confirmation bias is the tendency to search for, interpret, or recall information in a way that confirms or reinforces one’s beliefs. The problem here is that the naysayer argument becomes a red herring to a conversation that should be centered on increasing the adoption of good practices. Rather, it serves to distract and inhibit the many benefits of sharing de-identified health data for secondary purposes. This is best exemplified by the Washington State re-identification attack in 2013. The impact was immediate as the State of Washington reduced their willingness to share data. Their reaction was unfortunate because it stifled access to valuable datasets that was important for public health and other types of analytics. Thankfully, it was temporary (even though temporary was still a long period of time). This example shows the negative impact red herrings have on the ability for researchers to gain access to very valuable data. The reality is limiting access of health data for secondary purposes stifles research and innovation that could lead to the betterment of all of us.
Additional reading on the subject can be found here:
Big Data and Innovation, Setting the Record Straight: De-identification Does Work
Do we have to worry about re-identification attacks upon our health data?
A Systematic Review of Re-Identification Attacks on Health Data
From the blog
- Can you comply your way to greatness?November 21, 2019
- When to Integrate Anonymization of Documents and DataSeptember 26, 2019
- Deep-Diving into Re-identification: Perspectives On An Article In Nature CommunicationsSeptember 26, 2019
- Learning at Scale: Anonymizing Unstructured Data using AI/MLSeptember 26, 2019
- Early Impact of Health Canada’s New GuidelinesJune 21, 2019
Recent News
- GDPR and The Future of Clinical Trials Data SharingMarch 18, 2019
- Advancing Principled Data Practices in Support of Emerging TechnologiesMarch 15, 2019
- “Zero Risk Does Not Exist”February 7, 2019
- Is Anonymization Possible with Current Technologies?January 9, 2019
- Comparing the benefits of pseudonymisation and anonymisation under the GDPRDecember 20, 2018