Using Linked Data for Big Data Analytics

As more and more data is linked from various sources – EHRs, financial data, social media, form data, wearables and apps – we see regular PII quickly becoming PHI. The challenge with linked data is that the aggregation of multiple datasets means there is a significant amount of PHI in the data.

Linked data will include many direct identifiers like Social Security Number, health insurance number or usernames. It will also include numerous indirect identifiers like postal code, profession, date of birth, and diagnoses. De-identifying data requires masking or removal of these direct identifiers to eliminate the possibility of these fields being used to easily re-identify someone. Indirect identifiers, on the other hand, are useful for analysis and need to be retained. This data provides valuable insights into regional variations, socioeconomic impacts or behavioral influences on health. While these fields will be de-identified, it is desirable to keep as much specificity as possible in these data elements for analytic purposes. However, the more indirect identifiers there are associated with an individual, the easier it will be to re-identify that person.

How to De-identify Linked Data Effectively

De-identifying linked data to a high standard cannot be achieved using the Safe Harbor method which focuses on removing 18 specified data elements. Responsible data sharing requires the use of a risk-based approach, like Expert Determination, which relies on expert use of statistical principles to render information non-identifying. Expert Determination scientifically measures the re-identification risk in the data from the presence of indirect identifiers.

De-identification techniques (generalization, aggregation, shuffling and randomization) can then be applied to reduce the data’s identifiability to a degree consistent with precedents set out by reputable data organizations, like the Centers for Disease Control. The ability to reliably anonymize patient data while retaining high data quality is the reason that leading data organizations from around the world, like the Institute of Medicine, HITRUST, PhUSE and the Canadian Council of Academies, have all recommended the use of a risk-based approach, like Expert Determination, to de-identify data. Determining how to address privacy issues is one of the major barriers to organizations moving ahead with Big Data Analytics (BDA) initiatives. Companies that want to pursue opportunities with BDA need to establish strong privacy practices in addition to implementing risk-based data de-identification.

Compliant, robust software solutions are available for organizations ready to harness the power of their data for BDA – watch the Privacy Analytics Eclipse overview today to learn how your organization can become a BDA machine.

Free Webinar: De-Identification 101

Join Privacy Analytics for a high level introduction of de-identification and data masking.
Watch now

Free Download: De-Id 101

You have Successfully Subscribed!