Khaled El Emam: New BMJ Blog Contributor

Khaled El Emam is now a new BMJ Blog contributor. Read the entire post here.

Anonymization and Creepy Analytics | October 22, 2013

When health data is shared for secondary purposes, such as for research, there is always the concern about patient privacy. Data custodians want to at least meet the requirements in their relevant laws and regulations. One option for sharing data is to anonymize it beforehand. But anonymization does not protect against stigmatizing analytics, which are often seen as a form of privacy violation.

Stigmatizing analytics are inferences from the data that may have a negative impact on a data subject or, more often, a group of data subjects. The impact would occur due to decisions made from the inferences. Negative impact may be social, financial, reputational, or psychological. For example, an inference that individuals living in close proximity to an industrial site have a higher than expected incidence of cancer may make these individuals less employable (because they could cause their employee insurance premiums to be much higher) and publication of such a finding could dramatically reduce their property values.

Sometimes inferences from data are referred to as “creepy”. This is essentially the same idea except that the impact on the data subjects is that they feel violated. For example, if a supermarket can determine that you are pregnant before your family does based on changes in your purchasing behavior, or a web site can determine your sexual orientation from the sites that you visit or whom you friend on a social network (and starts serving you advertisements accordingly) – that would be creepy.

Stigmatizing analytics can occur even for individuals not in the dataset. For example, consider an anonymized dataset that was analysed to build a regression model that allows one to make inferences about a group of individuals. The model can then be used to make predictions about any individual irrespective of whether their data was in the original dataset or not. As long as the relevant values are available as input into the regression model, a prediction can be made.

This regression model can be built on properly anonymized data. Anonymization, as defined in contemporary laws and regulations, does not protect against such inferences. Anonymization is only concerned with ensuring that the identity of the data subjects cannot be determined. As long as that identity has a very small probability of being determined from the data, then the anonymization requirements are met.

There are two general ways to address the risks from stigmatizing analytics. One way is to modify the data to make it difficult to draw inferences. For example, one can add noise to the data or suppress records or values. This, however, will often result in a dataset that is not very analytically useful because the ability to draw inferences would have to be, by definition, curtailed.

The second approach is to use proper governance mechanisms. In the context of research, an ethics committee would normally review protocols to evaluate the risk of group harm, and stigma that may affect study participants. However, outside the research context these types of review committees would need to be created. They would evaluate analysis protocols to determine what types of models would be developed on the data and how these models would be used (for example, what kinds of decisions would be made using the models). The objective of such a committee is to ensure that model development and use are consistent with prevailing cultural and social norms, and assess the potential negative impacts on individuals, whether they are in the dataset or not. That is the only practical way to manage the risks from stigmatizing analytics.

 

Free Webinar: De-Identification 101

Join Privacy Analytics for a high level introduction of de-identification and data masking.
Watch now

Free Download: De-Id 101

You have Successfully Subscribed!