Learning at Scale: Anonymizing Unstructured Data using AI/ML

Learning at Scale: Anonymizing Unstructured Data using AI/ML

An article by Rachel Li, Senior Machine Learning Engineer, Privacy Analytics

Unstructured data, such as medical notes, pose unique challenges with regards to anonymization. 

Not having a traditional database means that you can’t point a software solution to it to transform variables in a straightforward manner. Unless you have an army of medically trained people at your disposal to read documents so they can find personal data and anonymize, you’re more likely to train an AI/ML system to scan text and discover sensitive personal data for anonymization. However, training an AI/ML system in medical jargon and the structure of medical notes requires significant effort.

These challenges are being felt by drug manufacturers that need to make their anonymized clinical study reports (CSRs) publicly available after a regulatory decision is made by the EMA under Policy 0070, and starting in 2019 by Health Canada (which motivates industry leaders to “proactively anonymize clinical studies and data at scale”). Prominent academic researchers are calling for the FDA to follow. Both the EMA and Health Canada are recommending the use of quantitative risk-based anonymization approaches. As such transparency initiatives gather momentum, the number of CSRs that sponsors need to anonymize is growing rapidly, and a typical CSR has 5,000 pages including narratives and summary statistics. This is pushing for improvements to scale risk-based anonymization in a cost-effective manner.

In a paper published jointly by Privacy Analytics and Vanderbilt University researchers (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6568071/) at the AMIA 2019 Informatics Summit which received the Best Data Science Paper Award, we specifically focus on streamlining the anonymization of such unstructured data with machine learning and AI.

When the student becomes the master

One of the major caveats in using machine learning on real-world data is its continuous demand for high quality annotated training data and, subsequently, the amount of human effort involved in tagging documents manually to create annotations. In order to alleviate the overall cost and to support a more scalable anonymization pipeline, we proposed and demonstrated an active-learning-based framework in the paper. Instead of iteratively selecting data at random from a dataset for training (known as passive learning), the new pipeline works by allowing the machine learning system to intelligently select the data to be annotated by human, and for the system to learn from this selectively annotated data. The system learns and improves its accuracy throughout the process, and eventually requires less human effort thereby reducing cost of the anonymization process.

We conducted a series of controlled and systematic simulations on real-world CSRs and a publicly available dataset to evaluate the performance of the active learning pipeline. The results showed that active learning can yield comparable and even better performance with up to 50% less training data than passive learning.

A follow-up internal user study comparing the active learning pipeline and a rule-based anonymization system previously in use further showed that the active learning pipeline has the potential of reducing the overall time to finish PHI detection for 100 pages of CSRs by 80%.

Teaching to inspire efficiency wins

Advances in AI/ML such as these demonstrate Privacy Analytics’ ability to scale the anonymization of unstructured data and allow for our in-house experts to handle more studies and focus on higher level tasks. Knowledgeable “teachers” will still be needed for annotating and guiding AI/ML systems, but the growing volumes of unstructured medical data will be met with increasingly effective automation through efficiency gains, especially with the aid of more advanced technologies (e.g., deep neural networks). This will also mean the ability to leverage more sensitive data, be it for transparency or competitive advantage.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.