If you’ve been involved in anonymization projects (such as Expert Determinations under HIPAA), you may have been asked to describe the data context in detail: how the data will be used, where it will reside, and who will be accessing it. This article explains a couple of reasons why context matters and why environmental controls play an important role in the analysis and the resulting recommendations from a statistical expert.
Blending into the Crowd
First, context matters because it allows the statistical expert to figure out which fields in the data can be used to identify people.
When individuals have been properly de-identified or anonymized, we can think of them as blending into the crowd of other individuals in their population. Put more simply, we are targeting datasets with very few or no “unicorns” – no individuals with highly unique features that could be isolated and become easier targets for re-identification attempts.
But which features should we consider when evaluating uniqueness? In practice, we want to assess only identifiers, which are fields that are pragmatically attackable by an adversary. Identifiers generally must satisfy three conditions. They must be
- Replicable, or stable over some reasonable amount of time
- Distinguishing, in that they are differentiated between individuals in the dataset, and
- Knowable, in that an adversary can learn this piece of information outside of the dataset, to compare to a person in the dataset and match information in a re-identification attempt.
It’s this last point of knowability that can depend strongly on the context of a data release. HIPAA describes the risk of re-identification by the “anticipated recipient”—so the key question is what does an anticipated recipient know, rather than what is knowable by an arbitrary person (or a person selected in a worst-case scenario). If a recipient can’t reasonably know the information in a field, it would not be considered an identifier that differentiates an individual in a data set and makes them a unicorn.
Chances of a Re-identification Attempt
Understanding context is also important because it allows the statistical expert to gauge the likelihood of a re-identification attempt happening at all.
When individuals in a dataset are re-identified, we can think of this occurring in two sequential steps:
- A re-identification attempt is made
- That attempt is successful
The context of the data release can affect the likelihood of the first step. For example, strong electronic or physical security controls, strong contractual controls, and good policies, can all reduce the likelihood of a re-identification attempt, because it can be more difficult for an adversary to access the data they want to re-identify. Likewise, if people with access to the data are not likely to be motivated or able to attempt re-identification, this too reduces the likelihood of a re-identification attempt. The use of multi-layered safeguards, stringent privacy practices, and risk modeling can allow for strong data utility while maintaining robust privacy protections.
In sum, while the specific data elements and granularity in a dataset affect how identifiable it is, another key component in assessing identifiability is the data release context. By accurately characterizing context, a statistical expert can account for the protective effects of a well-protected data release context (or equally, the increased threat of an under- or unprotected release context), ensuring that the data transformations are fit for purpose and the overall identifiability is well-managed.
Contact the experts at Privacy Analytics to learn more about this or other privacy topics.