Demand for de-identified free-text data has been steadily increasing as organizations seek to drive research, product development, and organizational insights. De-identified free text in natural language format is especially valuable for generative AI training or as a source for NLP or extracted insights, where the extraction tooling and pipelines vary and are customizable.
In contrast to tabular or structured data, the key challenge in de-identifying text is detection — specifically, determining which words or phrases are identifiers that should be assessed for re-identification risk and considered for transformation. In large-scale research and product development, detection almost always relies on an automated software tool that flags identifiers in the data, effectively adding structure that enables more familiar de-identification methods.
The detection process is critical to free-text de-identification: If identifiers aren’t detected, they can’t be assessed or transformed. Detection performance is evaluated through an annotation process in which a random, representative sample of documents from the data to be de-identified is thoroughly examined. It’s vital that this annotation sample be independent of any data used to configure or train the detection process.
The annotation sample is marked up (usually by manual reviewers) to create a gold standard for detection, which is then compared with the output of the detection process. The comparison characterizes detection performance, revealing what was:
The annotation process quantifies detection performance on the annotation sample and, more importantly, enables statistical modeling of how that performance generalizes across the full set of source documents.
Clearly, an effective detection process is critical to ensuring defensible de-identification, but how effective does it need to be? If a detection and de-identification tool vendor claims a 98% detection rate, how does that translate into the privacy of the output documents? How much evidence is required to characterize detection performance?
Answering these questions can significantly impact a de-identification initiative. Chasing incremental performance improvements in a detection tool is a challenging task that is often unpredictable in terms of time, cost, and the magnitude of output improvements. Detection performance assessment becomes more precise with larger annotation samples, but larger samples also require more effort, time, and cost.
Most de-identification experts use detection rates from annotation to inform their models of re-identification risk, quantify the impact of detection on overall risk, and set an appropriate detection target.
Privacy Analytics’ proprietary text de-identification methods build on this approach, improving the annotation process to capture not only what is detected but also what goes undetected. The same annotation activity that flags where detection succeeds also naturally flags where the process is lacking, cataloging examples and the rates at which misses occur. Both categories of information are used to quantify the re-identification risk of the output data under different transformation approaches. The detected information is measured directly, and the undetected information is modeled based on the findings from the annotation.
Our advanced workflow also quantifies measurement uncertainty, ensuring the numbers remain defensible. Statistically, the findings of an annotation contain a range of uncertainty, with a wider range for smaller samples and a narrower range for larger samples. Modeling using the poorer end of the detection performance range ensures defensibility. This gives the de-identification expert another parameter to work with when assessing re-identification risks. They might:
With this more advanced workflow for re-identification risk assessment and de-identification, we can tailor the approach not only to the initiative’s utility needs but also to the organization’s practical constraints, such as timelines, budgets, and predictability. This workflow accurately and defensibly defines “good enough” and supports a flexible approach that aligns with the organization’s needs.
When integrated into a risk-based de-identification workflow, this advanced approach enables high-fidelity modeling of re-identification risks and quantifies the effects of transformations on text data. Our method helps organizations make evidence-based decisions about investing in improved tooling or confirms that existing measures adequately mitigate risk.
Contact Privacy Analytics to learn more about how our advanced approaches to de-identifying unstructured data can empower your organization.