Using Linked Data for Big Data Analytics
As more and more data is linked from various sources – EHRs, financial data, social media, form data, wearables and apps – we see regular PII quickly becoming PHI. The challenge with linked data is that the aggregation of multiple datasets means there is a significant amount of PHI in the data.
Linked data will include many direct identifiers like Social Security Number, health insurance number or usernames. It will also include numerous indirect identifiers like postal code, profession, date of birth, and diagnoses. De-identifying data requires masking or removal of these direct identifiers to eliminate the possibility of these fields being used to easily re-identify someone. Indirect identifiers, on the other hand, are useful for analysis and need to be retained. This data provides valuable insights into regional variations, socioeconomic impacts or behavioral influences on health. While these fields will be de-identified, it is desirable to keep as much specificity as possible in these data elements for analytic purposes. However, the more indirect identifiers there are associated with an individual, the easier it will be to re-identify that person.
How to De-identify Linked Data Effectively
De-identifying linked data to a high standard cannot be achieved using the Safe Harbor method which focuses on removing 18 specified data elements. Responsible data sharing requires the use of a risk-based approach, like Expert Determination, which relies on expert use of statistical principles to render information non-identifying. Expert Determination scientifically measures the re-identification risk in the data from the presence of indirect identifiers.
De-identification techniques (generalization, aggregation, shuffling and randomization) can then be applied to reduce the data’s identifiability to a degree consistent with precedents set out by reputable data organizations, like the Centers for Disease Control. The ability to reliably anonymize patient data while retaining high data quality is the reason that leading data organizations from around the world, like the Institute of Medicine, HITRUST, PhUSE and the Canadian Council of Academies, have all recommended the use of a risk-based approach, like Expert Determination, to de-identify data. Determining how to address privacy issues is one of the major barriers to organizations moving ahead with Big Data Analytics (BDA) initiatives. Companies that want to pursue opportunities with BDA need to establish strong privacy practices in addition to implementing risk-based data de-identification.
Compliant, robust software solutions are available for organizations ready to harness the power of their data for BDA – watch the Privacy Analytics Eclipse overview today to learn how your organization can become a BDA machine.
- Can you comply your way to greatness?November 21, 2019
- When to Integrate Anonymization of Documents and DataSeptember 26, 2019
- Deep-Diving into Re-identification: Perspectives On An Article In Nature CommunicationsSeptember 26, 2019
- Learning at Scale: Anonymizing Unstructured Data using AI/MLSeptember 26, 2019
- Early Impact of Health Canada’s New GuidelinesJune 21, 2019
- GDPR and The Future of Clinical Trials Data SharingMarch 18, 2019
- Advancing Principled Data Practices in Support of Emerging TechnologiesMarch 15, 2019
- “Zero Risk Does Not Exist”February 7, 2019
- Is Anonymization Possible with Current Technologies?January 9, 2019
- Comparing the benefits of pseudonymisation and anonymisation under the GDPRDecember 20, 2018