An article by Rachel Li, Senior Machine Learning Engineer, Privacy Analytics
Unstructured data, such as medical notes, pose unique challenges with regards to anonymization.
Not having a traditional database means that you can’t point a software solution to it to transform variables in a straightforward manner. Unless you have an army of medically trained people at your disposal to read documents so they can find personal data and anonymize, you’re more likely to train an AI/ML system to scan text and discover sensitive personal data for anonymization. However, training an AI/ML system in medical jargon and the structure of medical notes requires significant effort.
These challenges are being felt by drug manufacturers that need to make their anonymized clinical study reports (CSRs) publicly available after a regulatory decision is made by the EMA under Policy 0070, and starting in 2019 by Health Canada (which motivates industry leaders to “proactively anonymize clinical studies and data at scale”). Prominent academic researchers are calling for the FDA to follow. Both the EMA and Health Canada are recommending the use of quantitative risk-based anonymization approaches. As such transparency initiatives gather momentum, the number of CSRs that sponsors need to anonymize is growing rapidly, and a typical CSR has 5,000 pages including narratives and summary statistics. This is pushing for improvements to scale risk-based anonymization in a cost-effective manner.
In a paper published jointly by Privacy Analytics and Vanderbilt University researchers (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6568071/) at the AMIA 2019 Informatics Summit which received the Best Data Science Paper Award, we specifically focus on streamlining the anonymization of such unstructured data with machine learning and AI.
When the student becomes the master
One of the major caveats in using machine learning on real-world data is its continuous demand for high quality annotated training data and, subsequently, the amount of human effort involved in tagging documents manually to create annotations. In order to alleviate the overall cost and to support a more scalable anonymization pipeline, we proposed and demonstrated an active-learning-based framework in the paper. Instead of iteratively selecting data at random from a dataset for training (known as passive learning), the new pipeline works by allowing the machine learning system to intelligently select the data to be annotated by human, and for the system to learn from this selectively annotated data. The system learns and improves its accuracy throughout the process, and eventually requires less human effort thereby reducing cost of the anonymization process.
We conducted a series of controlled and systematic simulations on real-world CSRs and a publicly available dataset to evaluate the performance of the active learning pipeline. The results showed that active learning can yield comparable and even better performance with up to 50% less training data than passive learning.
A follow-up internal user study comparing the active learning pipeline and a rule-based anonymization system previously in use further showed that the active learning pipeline has the potential of reducing the overall time to finish PHI detection for 100 pages of CSRs by 80%.
Teaching to inspire efficiency wins
Advances in AI/ML such as these demonstrate Privacy Analytics’ ability to scale the anonymization of unstructured data and allow for our in-house experts to handle more studies and focus on higher level tasks. Knowledgeable “teachers” will still be needed for annotating and guiding AI/ML systems, but the growing volumes of unstructured medical data will be met with increasingly effective automation through efficiency gains, especially with the aid of more advanced technologies (e.g., deep neural networks). This will also mean the ability to leverage more sensitive data, be it for transparency or competitive advantage.