Dealing with Data Variety in Healthcare

Healthcare does not always rely on discrete pieces of data. While many elements captured in an EHR or used in insurance claim and billing systems are structured data, much of what is generated on a patient’s journey through the healthcare system will use XML, narrative text or images — all forms of unstructured data. Barring patient consent, all PHI must be de-identified, regardless of its format, before it can be legally used for secondary purposes.

With structured data, locating direct identifiers (e.g., name) or indirect identifiers (e.g., age) in the data is a relatively straightforward process for anyone knowledgeable in de-identification methodologies. Database fields are generally labelled to show the information that they hold, so, if you know what to look for, it is easy enough to find the identifying fields.

Capturing identifiers in unstructured data, however, is more complex. Direct and indirect identifiers are not labelled as such; therefore, locating identifiers in the text is a far more process-intensive task. While the inclusion of unstructured data in Big Data Analytics (BDA) is a dimension that provides a rich source of information for analysis, it raises the bar for privacy.

De-Identifying Unstructured Data

To begin, let’s look briefly at the techniques used to de-identify text data. The aim of text anonymization is to extract any personally identifiable information, that is, any direct or indirect identifiers found therein. There are a number of techniques to accomplish this task.

  1. Redaction: The simplest approach to de-identification, it amounts to “blacking out” or masking the data. Identifiers are replaced with a set of characters, like “****”. Redaction leaves no indication of the type of information that was removed.
  2. Tagging: This replaces the element with the type of information that was removed. For example, the name “Amy” would be replaced by <FirstName>. We can go further, indexing instances of the same name so that “Amy” appears as <FirstName:1> and a new name, e.g. “Bob”, would be tagged as <FirstName:2>. This lets us recognize references to the same individual across the document.
  3. Randomization: Similar to its use with structured data, this approach replaces every instance of a direct identifier, e.g. the name “Amy”, with a randomly selected value according to the tag type. In this case “Amy” would be replaced by a value from the names database, such as “Louise”.
  4. Generalization: This approach is used more commonly with indirect identifiers where a generalized value is used to replace the original value. If we have a patient birth date of May 18, 1976, it could be generalized to May 1976 or 1976.

Precision and Recall

When dealing with free-text data, the threshold for detecting identifiers, in particular direct identifiers, must be extraordinarily high. It is not sufficient to catch 80%, or even 90%, of the names contained in a document. If we have a document containing 10 first names, for example, finding and de-identifying 90% of those names means that we have one name that remains in the document. This single missed name counts as a breach under current privacy legislation.

The proportion of the personal identifiers that are found in the text is referred to as recall. In the above case, the recall was 90%. In healthcare, where the inclusion of even one name is considered a breach, we must have very high recall. In fact, it is necessary to take an “all or nothing” approach to evaluation. Finding nine of 10 names does not give us a 90% success rate; rather, it is a 100% failure rate.

While it may seem like the solution is to simply cast a wide net to capture all possible identifiers, care must be exercised or else the quality and usefulness of the data will be negatively impacted. Precision is the measure of how many identifiers are correctly detected. In other words, how often is information that is not identifying being redacted or randomized? Low precision means that the system is redacting information that it does not need to. If low recall means that we miss redacting “Charles”, low precision means we have redacted non-identifying words like “cough”, “chronic” or “cancer”, information that could be useful to the analysis.

The measures of recall and precision are further challenged by the idiosyncrasies of medical information. Drug and disease names often have many variants. A heart attack can be referred to as a cardiac arrest, coronary infarction, myocardial infarction or simply by the abbreviation M.I. Drugs can have multiple brand names as well as their generic name. Abbreviations abound in medicine and some of these may be confused with postal or zip codes, like the S1S2S4 nomenclature denoting heart sounds or C1C2T1 to indicate specific vertebra of the spine. Add to this, typos that occur with rapid data input and the confounding factor of negations (e.g., “results indicate a lack of evidence for the condition…”) and it is apparent that the de-identification of unstructured medical data requires expert know-how and ability.

A robust solution that can address the challenges of de-identifying text data, along with structured medical data, is required for health data to be disclosed for secondary uses. A risk-based approach to de-identification is applicable to both structured and unstructured data, allowing re-identification risks to be managed enterprise-wide. It allows an automated and scalable process to be put in place, providing a level of privacy that could not be reliably achieved using manual text de-identification processes.


Dealing with Data Variety in Healthcare is the fourth in the Big Data Analytics Series by Privacy Analytics. Next week: The Future of Big Data Analytics.

Free Webinar: De-Identification 101

Join Privacy Analytics for a high level introduction of de-identification and data masking.
Watch now

Free Download: De-Id 101

You have Successfully Subscribed!