Unstructured data sources—including text, images, audio, and video—are rich sources of information. They can drive technological innovation and power diverse use cases such as synthetic data generation, software testing, LLMs, and other generative AI.
Safe Secondary Use of Data
Many organizations focus on the safe use of unstructured text beyond its original intended purpose. This text data can come from a variety of sources, including medical records, conversation transcripts, and customer feedback forms. Healthcare companies often use unstructured text to power next-gen analytics for diagnostics and treatment or to develop generative AI to assist clinicians, improving patient outcomes.
Organizations increasingly enable innovations and findings by de-identifying or anonymizing their unstructured data. This may require:
- De-identifying data under HIPAA to enable secondary uses of health data in the US
- Anonymizing data under the expectations of the GDPR
- Aligning to best practices for identifying and mitigating re-identification risks detailed in the ISO/IEC 27559 standard
- Otherwise transforming data to meet a relevant regulatory requirement
Assessing Your De-Identification Mechanism
Regardless of their scenario, organizations must assess any unstructured de-identification tool they want to use to determine its efficacy. This assessment should consider:
- The data and the identifiers it contains, such as:
- Direct identifiers like name, home address, phone number, and email.
- Indirect identifiers (or quasi-identifiers) like age, gender, race, ethnicity, or postal/ZIP code.
- The data release context, or the contractual and security protections and access controls applied to the data that reduce the likelihood of a re-identification attempt.
De-identifying Structured vs. Unstructured Data
For structured data, identifiers can be flagged at the level of structure – for example, we’d expect that a patient’s name would be in fields labeled ‘name’ (and not in fields labeled ‘date of birth’!).
Name | Date of Birth | Gender | Visit Date |
---|---|---|---|
Mae Funke | 09/22/1990 | F | 05/18/2024 |
Byron Bluth | 11/03/1972 | M | 05/19/2024 |
... | ... | ... | ... |
By contrast, the first step in de-identifying unstructured text is detecting the identifiers embedded in the text. The detection process comprises flagging individual instances of identifiers in the text data, which can then be measured or transformed.
The Benefits of a Re-identification Risk Determination
Privacy Analytics has robust methods to account for detection performance and the impact of identifiers as part of an overall Re-identification Risk Determination (RRD), which can be performed in alignment with the expectations of HIPAA, the GDPR, or other regulations. Our RRD for Unstructured Text tailors recommendations to the text de-identification in place in your organization’s workflow rather than imposing an inflexible off-the-shelf standard.
The analysis considers
- the protections in place for de-identified data in the context where it will be used,
- the data flows and recipient teams or organizations,
- the detection performance of the text de-identification tool in flagging identifiers, and
- the nature (and degree) of the data transformations applied to detected identifiers while considering the utility needs of the de-id output.
The result is an accelerated, pragmatic assessment of the data, the text de-identification tool being used, and the existing data-sharing scenario with recommendations on how best to unblock data sharing. This assessment is supported by auditable documentation that shows how the overall workflow is defensible and data privacy compliant.
Why to Choose Privacy Analytics
Our approach builds on the technology, methodology, knowledge, and experience we’ve gained enabling 135+ de-identified document submissions to the European Medical Agency (EMA), Health Canada, and other regulators since 2018.
In that time, we’ve helped organizations unlock the value of their unstructured text data for diverse purposes and data-sharing scenarios. This includes healthcare data analytics, automatic transcription, powering AI virtual assistants, and linking structured data to unstructured text.
Contact the experts at Privacy Analytics to learn more about how we can help your organization improve its text de-identification workflows with our assessment and RRD services.