Privacy Analytics - Learning at Scale: Anonymizing Unstructured Data using AI/ML

Privacy Analytics > Resources > Articles > Learning at Scale: Anonymizing Unstructured Data using AI/ML

Learning at Scale: Anonymizing Unstructured Data using AI/ML

An article by Rachel Li, Senior Machine Learning Engineer, Privacy Analytics

Unstructured data, such as medical notes, pose unique challenges with regards to anonymization.

Not having a traditional database means that you can’t point a software solution to it to transform variables in a straightforward manner. Unless you have an army of medically trained people at your disposal to read documents so they can find personal data and anonymize, you’re more likely to train an AI/ML system to scan text and discover sensitive personal data for anonymization. However, training an AI/ML system in medical jargon and the structure of medical notes requires significant effort.

These challenges are being felt by drug manufacturers that need to make their anonymized clinical study reports (CSRs) publicly available after a regulatory decision is made by the EMA under Policy 0070, and starting in 2019 by Health Canada (which motivates industry leaders to “proactively anonymize clinical studies and data at scale”). Prominent academic researchers are calling for the FDA to follow. Both the EMA and Health Canada are recommending the use of quantitative risk-based anonymization approaches. As such transparency initiatives gather momentum, the number of CSRs that sponsors need to anonymize is growing rapidly, and a typical CSR has 5,000 pages including narratives and summary statistics. This is pushing for improvements to scale risk-based anonymization in a cost-effective manner.

In a paper published jointly by Privacy Analytics and Vanderbilt University researchers (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6568071/) at the AMIA 2019 Informatics Summit which received the Best Data Science Paper Award, we specifically focus on streamlining the anonymization of such unstructured data with machine learning and AI.

When the student becomes the master

One of the major caveats in using machine learning on real-world data is its continuous demand for high quality annotated training data and, subsequently, the amount of human effort involved in tagging documents manually to create annotations. In order to alleviate the overall cost and to support a more scalable anonymization pipeline, we proposed and demonstrated an active-learning-based framework in the paper. Instead of iteratively selecting data at random from a dataset for training (known as passive learning), the new pipeline works by allowing the machine learning system to intelligently select the data to be annotated by human, and for the system to learn from this selectively annotated data. The system learns and improves its accuracy throughout the process, and eventually requires less human effort thereby reducing cost of the anonymization process.

We conducted a series of controlled and systematic simulations on real-world CSRs and a publicly available dataset to evaluate the performance of the active learning pipeline. The results showed that active learning can yield comparable and even better performance with up to 50% less training data than passive learning.

A follow-up internal user study comparing the active learning pipeline and a rule-based anonymization system previously in use further showed that the active learning pipeline has the potential of reducing the overall time to finish PHI detection for 100 pages of CSRs by 80%.

Teaching to inspire efficiency wins

Visit Nuance.com to read the full story >

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.

Learning at Scale: Anonymizing Unstructured Data using AI/ML

Unstructured data, such as medical notes, pose unique challenges with regards to anonymization.

When the student becomes the master

Teaching to inspire efficiency wins

More Articles

7 Questions to Evolve Your Privacy Strategy

3 Core Steps to Developing a Robust Privacy Strategy

How Context Affects Anonymization in AI Model Development

Data Privacy, AI, De-identification, and Anonymization: Putting It All Together

Why Context Matters When Anonymizing Data

How to Work Safely with Unstructured Text Data

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Join the next 5 Safes Data Privacy webinar