Real Anonymization vs Data Masking
After reading Kalev Leetaru’s article, The Big Data Era of Mosaicked Deidentification: Can We Anonymize Data Anymore?, there are a few things that we can agree on.
Leetaru’s article discusses how “anonymized” data sets are increasingly common and re-identified with ease. He cites Sweeney’s study from 2000 and notes several famously “anonymized” datasets that lead to re-identification, Netflix, AOL, and the NYC taxi debacle. These are all very persuasive examples that can prompt people to assume all anonymization is terrible and easily reversed.
He writes, “As more and more organizations begin to release sensitive datasets to the public, the data science community must spend more time thinking about how to safely and responsibility manage this flow of anonymized data that is the lifeblood of the big data era.” Privacy and data use are key ingredients when considering how anonymization can be incorporated into a data sharing work flow.
Real Anonymization vs Data Masking: Not the Same
But there is one point of disagreement. In his article, he talks about “anonymization”. Anonymization is the process of turning data into a form which does not identify individuals and where identification is not likely to take place. None of the examples in his article are examples of anonymization. They are examples of data masking though, and poorly done data masking at that. This distinction is key because there are people and organization that anonymize data effectively every day – but they don’t make the news like these sensationalized stories.
In Sweeney’s case, the de-identification performed wasn’t even compliant with HIPAA’s Safe Harbor method (the minimum standard for de-identifying PHI for secondary use). In the AOL example, the scheme used to anonymize patients failed to address the most identifying information of all – their search data! That data was immediately identifying – 56% of internet users have looked for themselves online.
When you incorporate a risk-based de-identification process, you can be confident that PHI in the data has truly been anonymized. That’s why so many standards and industry guidelines are advocating for this approach, including HITRUST, the Institute of Medicine and the European Medicines Agency.
Not all regulators and industry groups are ready to dismiss anonymization. To learn more about new and emerging standards around health data de-identification, don’t miss our webinar: De-identification 201.
- Can you comply your way to greatness?November 21, 2019
- When to Integrate Anonymization of Documents and DataSeptember 26, 2019
- Deep-Diving into Re-identification: Perspectives On An Article In Nature CommunicationsSeptember 26, 2019
- Learning at Scale: Anonymizing Unstructured Data using AI/MLSeptember 26, 2019
- Early Impact of Health Canada’s New GuidelinesJune 21, 2019
- GDPR and The Future of Clinical Trials Data SharingMarch 18, 2019
- Advancing Principled Data Practices in Support of Emerging TechnologiesMarch 15, 2019
- “Zero Risk Does Not Exist”February 7, 2019
- Is Anonymization Possible with Current Technologies?January 9, 2019
- Comparing the benefits of pseudonymisation and anonymisation under the GDPRDecember 20, 2018