Why Context Matters When Anonymizing Data

Why Context Matters When Anonymizing Data

An article by Brian Rasquinha, Associate Director, Solution Architecture, Privacy Analytics

If you’ve been involved in anonymization projects (such as Expert Determinations under HIPAA), you may have been asked to describe the data context in detail: how the data will be used, where it will reside, and who will be accessing it. This article explains a couple of reasons why context matters and why environmental controls play an important role in the analysis and the resulting recommendations from a statistical expert.

Blending into the Crowd

First, context matters because it allows the statistical expert to figure out which fields in the data can be used to identify people.

When individuals have been properly de-identified or anonymized, we can think of them as blending into the crowd of other individuals in their population. Put more simply, we are targeting datasets with very few or no “unicorns” – no individuals with highly unique features that could be isolated and become easier targets for re-identification attempts.

But which features should we consider when evaluating uniqueness? In practice, we want to assess only identifiers, which are fields that are pragmatically attackable by an adversary. Identifiers generally must satisfy three conditions. They must be

  • Replicable, or stable over some reasonable amount of time
  • Distinguishing, in that they are differentiated between individuals in the dataset, and
  • Knowable, in that an adversary can learn this piece of information outside of the dataset, to compare to a person in the dataset and match information in a re-identification attempt.

It’s this last point of knowability that can depend strongly on the context of a data release. HIPAA describes the risk of re-identification by the “anticipated recipient”—so the key question is what does an anticipated recipient know, rather than what is knowable by an arbitrary person (or a person selected in a worst-case scenario). If a recipient can’t reasonably know the information in a field, it would not be considered an identifier that differentiates an individual in a data set and makes them a unicorn.

Chances of a Re-identification Attempt

Understanding context is also important because it allows the statistical expert to gauge the likelihood of a re-identification attempt happening at all.

When individuals in a dataset are re-identified, we can think of this occurring in two sequential steps:

  • A re-identification attempt is made
  • That attempt is successful

The context of the data release can affect the likelihood of the first step. For example, strong electronic or physical security controls, strong contractual controls, and good policies, can all reduce the likelihood of a re-identification attempt, because it can be more difficult for an adversary to access the data they want to re-identify. Likewise, if people with access to the data are not likely to be motivated or able to attempt re-identification, this too reduces the likelihood of a re-identification attempt. The use of multi-layered safeguards, stringent privacy practices, and risk modeling can allow for strong data utility while maintaining robust privacy protections.

In sum, while the specific data elements and granularity in a dataset affect how identifiable it is, another key component in assessing identifiability is the data release context. By accurately characterizing context, a statistical expert can account for the protective effects of a well-protected data release context (or equally, the increased threat of an under- or unprotected release context), ensuring that the data transformations are fit for purpose and the overall identifiability is well-managed.

Contact the experts at Privacy Analytics to learn more about this or other privacy topics.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.