How to Work Safely with Unstructured Text Data

How to Work Safely with Unstructured Text Data

An article by Brian Rasquinha, Associate Director, Solution Architecture, Privacy Analytics

Unstructured data sources—including text, images, audio, and video—are rich sources of information. They can drive technological innovation and power diverse use cases such as synthetic data generation, software testing, LLMs, and other generative AI.

Safe Secondary Use of Data

Many organizations focus on the safe use of unstructured text beyond its original intended purpose. This text data can come from a variety of sources, including medical records, conversation transcripts, and customer feedback forms. Healthcare companies often use unstructured text to power next-gen analytics for diagnostics and treatment or to develop generative AI to assist clinicians, improving patient outcomes.

Organizations increasingly enable innovations and findings by de-identifying or anonymizing their unstructured data. This may require:

  • De-identifying data under HIPAA to enable secondary uses of health data in the US
  • Anonymizing data under the expectations of the GDPR
  • Aligning to best practices for identifying and mitigating re-identification risks detailed in the ISO/IEC 27559 standard
  • Otherwise transforming data to meet a relevant regulatory requirement

Assessing Your De-Identification Mechanism

Regardless of their scenario, organizations must assess any unstructured de-identification tool they want to use to determine its efficacy. This assessment should consider:

  • The data and the identifiers it contains, such as:
    • Direct identifiers like name, home address, phone number, and email.
    • Indirect identifiers (or quasi-identifiers) like age, gender, race, ethnicity, or postal/ZIP code.
  • The data release context, or the contractual and security protections and access controls applied to the data that reduce the likelihood of a re-identification attempt.

De-identifying Structured vs. Unstructured Data

For structured data, identifiers can be flagged at the level of structure – for example, we’d expect that a patient’s name would be in fields labeled ‘name’ (and not in fields labeled ‘date of birth’!).

Name Date of Birth Gender Visit Date
Mae Funke 09/22/1990 F 05/18/2024
Byron Bluth 11/03/1972 M 05/19/2024
... ... ... ...

By contrast, the first step in de-identifying unstructured text is detecting the identifiers embedded in the text. The detection process comprises flagging individual instances of identifiers in the text data, which can then be measured or transformed.

Call notes 05/25: Mae Funke called to follow up on her visit last Saturday, asking for any updates on her file.

The Benefits of a Re-identification Risk Determination

Privacy Analytics has robust methods to account for detection performance and the impact of identifiers as part of an overall Re-identification Risk Determination (RRD), which can be performed in alignment with the expectations of HIPAA, the GDPR, or other regulations. Our RRD for Unstructured Text tailors recommendations to the text de-identification in place in your organization’s workflow rather than imposing an inflexible off-the-shelf standard.

The analysis considers

  • the protections in place for de-identified data in the context where it will be used,
  • the data flows and recipient teams or organizations,
  • the detection performance of the text de-identification tool in flagging identifiers, and
  • the nature (and degree) of the data transformations applied to detected identifiers while considering the utility needs of the de-id output.

The result is an accelerated, pragmatic assessment of the data, the text de-identification tool being used, and the existing data-sharing scenario with recommendations on how best to unblock data sharing. This assessment is supported by auditable documentation that shows how the overall workflow is defensible and data privacy compliant.

Why to Choose Privacy Analytics

Our approach builds on the technology, methodology, knowledge, and experience we’ve gained enabling 135+ de-identified document submissions to the European Medical Agency (EMA), Health Canada, and other regulators since 2018.

In that time, we’ve helped organizations unlock the value of their unstructured text data for diverse purposes and data-sharing scenarios. This includes healthcare data analytics, automatic transcription, powering AI virtual assistants, and linking structured data to unstructured text.

Contact the experts at Privacy Analytics to learn more about how we can help your organization improve its text de-identification workflows with our assessment and RRD services.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.