Anonymization at Scale: From Dataset to Pipeline

Anonymization at Scale: From Dataset to Pipeline

An article by Luk Arbuckle, Chief Methodologist, Privacy Analytics

In the evolving landscape of data enablement, the need to leverage data for driving efficiency, spurring innovation, and fostering collaboration has never been greater. This pursuit should be balanced with the escalating demands of information governance and privacy. As organizations delve into the intricacies of data utilization, the need for a robust and practical approach to anonymizing data emerges as a focal point. We explore the journey from evaluating a single dataset, to managing a pool of data, culminating in a system flow designed to produce consistently anonymized data, all underpinned by best practices aligned with international standard ISO/IEC 27559 De-identification Framework.

Readiness for international best practices

At the heart of responsible data management is the ISO/IEC 27559 standard, which provides a comprehensive framework for de-identifying personal data, including threat analysis, adversary testing, and governance of practices. This standard identifies and mitigates risks associated with anonymized or de-identified data and also establishes global best practices for data reuse and sharing. In a world where data about people is valued and responsibly treated as a protected asset, adhering to this standard ensures that organizations can responsibly leverage this data while maintaining public trust and upholding regulatory norms.

Understanding the breadth of international best practices is critical, differentiating between options and design parameters that will achieve alignment with key stakeholders and meet regulatory expectations. This approach transcends the confines of a single dataset, extending its applicability to pools of data and systems generating a flow of anonymized data (which we can call a data pipeline for short). This adaptability makes the approach to safely leveraging data a cornerstone for organizations aiming to meet international benchmarks in data security and privacy.

As shown in this diagram, as an organization collects more anonymized or de-identified datasets, the expectations outlined in international best practices will also increase. Evaluating a single dataset typically involves a third-party threat assessment (which we call re-identification risk determination) and adversary testing (also known as motivated intruder testing); evaluating a data pool, such as a data warehouse or data lake, will add the need for policy development and formal governance practices (specific to anonymized or de-identified data); evaluating or designing a data pipeline that produces such datasets will encompass all of the previous and more, since it could also involve more complex data flows and the creation of algorithms.

Get Your Free De-Identification Activity Map

Speed innovation with the right data de-identification and anonymization strategy. Download this free 2-page tool our experts use to help clients devise a strategic plan for scaling data enablement.

Preparing for a shifting regulatory landscape

With the advent of stringent enforcement mechanisms in new legislations such as USA state laws and the European Health Data Space (EHDS), the urgency to adhere to these practices is amplified. The alignment with global best practices in data reuse and sharing, outlined by the framework in ISO/IEC 27559 with implementation options described in regional guidance, is both a regulatory requirement and a distinct advantage. This standard and guidance serve as a yardstick against which data protection and privacy are evaluated for the effectiveness of anonymization and de-identification processes.

Think of international standard ISO/IEC 27559 as the scaffolding around which anonymization solutions can be built. However, other standards and guidance may also come into play depending on the scale of operations. For example, there may be a need to consider adherence to data protection and privacy legislation for platform or pipeline design for the lifecycle of data. Depending on how the resulting data is then used, this may also include emerging expectations for AI enablement, including risk and impact assessments, governance, and risk management in general.

Bridging the gap to best practices

By mapping your existing approach to ISO standards and guidance, we offer a clear pathway to best practices, ensuring that your organization is prepared for today’s challenges and is future-proofed against an evolving regulatory landscape. Through a detailed gap analysis, options summary, and the development of an implementation roadmap, our technical experts identify discrepancies and provide actionable insights to achieve alignment with international best practices and guidance.

Navigating the complex terrain of anonymized or de-identified data requires a intentional approach, one that balances the dual goals of leveraging data for organizational excellence and adhering to stringent privacy standards. Our design and engineering services, grounded in the ISO/IEC 27559 framework, as well as national and international guidance, offer a comprehensive solution that empowers your organization to achieve this balance, fostering responsible data reuse and sharing in the global arena.

Get Your Free
De-Identification Activity Map

Speed innovation with the right data de-identification and anonymization strategy. Download this free 2-page tool our experts use to help clients devise a strategic plan for scaling data enablement.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.