Learning at Scale: Anonymizing Unstructured Data using AI/ML
by Rachel Li, Ph.D. – Senior Machine Learning Engineer
Unstructured data, such as medical notes, pose unique challenges with regards to anonymization. Not having a traditional database means that you can’t point a software solution to it to transform variables in a straightforward manner. Unless you have an army of medically trained people at your disposal to read documents so they can find personal data and anonymize, you’re more likely to train an AI/ML system to scan text and discover sensitive personal data for anonymization. However, training an AI/ML system in medical jargon and the structure of medical notes requires significant effort.
These challenges are being felt by drug manufacturers that need to make their anonymized clinical study reports (CSRs) publicly available after a regulatory decision is made by the EMA under Policy 0070, and starting in 2019 by Health Canada (which motivates industry leaders to “proactively anonymize clinical studies and data at scale”). Prominent academic researchers are calling for the FDA to follow. Both the EMA and Health Canada are recommending the use of quantitative risk-based anonymization approaches. As such transparency initiatives gather momentum, the number of CSRs that sponsors need to anonymize is growing rapidly, and a typical CSR has 5,000 pages including narratives and summary statistics. This is pushing for improvements to scale risk-based anonymization in a cost-effective manner.
When the Student Becomes the Master
One of the major caveats in using machine learning on real-world data is its continuous demand for high quality annotated training data and, subsequently, the amount of human effort involved in tagging documents manually to create annotations. In order to alleviate the overall cost and to support a more scalable anonymization pipeline, we proposed and demonstrated an active-learning-based framework in the paper. Instead of iteratively selecting data at random from a dataset for training (known as passive learning), the new pipeline works by allowing the machine learning system to intelligently select the data to be annotated by human, and for the system to learn from this selectively annotated data. The system learns and improves its accuracy throughout the process, and eventually requires less human effort thereby reducing cost of the anonymization process.
We conducted a series of controlled and systematic simulations on real-world CSRs and a publicly available dataset to evaluate the performance of the active learning pipeline. The results showed that active learning can yield comparable and even better performance with up to 50% less training data than passive learning.
A follow-up internal user study comparing the active learning pipeline and a rule-based anonymization system previously in use further showed that the active learning pipeline has the potential of reducing the overall time to finish PHI detection for 100 pages of CSRs by 80%.
Teaching to Inspire Efficiency Wins
Advances in AI/ML such as these demonstrate Privacy Analytics’ ability to scale the anonymization of unstructured data and allow for our in-house experts to handle more studies and focus on higher level tasks. Knowledgeable “teachers” will still be needed for annotating and guiding AI/ML systems, but the growing volumes of unstructured medical data will be met with increasingly effective automation through efficiency gains, especially with the aid of more advanced technologies (e.g., deep neural networks). This will also mean the ability to leverage more sensitive data, be it for transparency or competitive advantage.
You might also like:
Early Impact of Health Canada’s New Guidelines
Sarah Lyons – Senior Director, Operations Health Canada has introduced new regulatory guidance for anonymizing clinical study reports. As background,…
Advancing Principled Data Practices in Support of Emerging Technologies
Interview with Jules Polonetsky, Chief Executive Officer, Future of Privacy Forum “I’m an optimist when it comes to data,” Jules…
“Zero Risk Does Not Exist”
Privacy Rights in the Digital Era was the headline theme for this event. The Institute for Humane Studies co-sponsored the…
Is Anonymization Possible with Current Technologies?
On December 11, 2018, Rebecca Herold, ‘The Privacy Professor’ hosted Privacy Analytics CEO Dr. Khaled El Emam for an on-demand podcast about…
- When to Integrate Anonymization of Documents and DataSeptember 26, 2019
- Deep-Diving into Re-identification: Perspectives On An Article In Nature CommunicationsSeptember 26, 2019
- Learning at Scale: Anonymizing Unstructured Data using AI/MLSeptember 26, 2019
- Early Impact of Health Canada’s New GuidelinesJune 21, 2019
- Privacy Analytics Events 2019April 10, 2019
- GDPR and The Future of Clinical Trials Data SharingMarch 18, 2019
- Advancing Principled Data Practices in Support of Emerging TechnologiesMarch 15, 2019
- “Zero Risk Does Not Exist”February 7, 2019
- Is Anonymization Possible with Current Technologies?January 9, 2019
- Comparing the benefits of pseudonymisation and anonymisation under the GDPRDecember 20, 2018