Learn the Basics of Data Tokenization and Linkage

Learn the Basics of Data Tokenization and Linkage

An article by Brian Rasquinha, Associate Director, Solution Architecture, Privacy Analytics

Increasingly, organizations are looking to link different data assets at the level of specific people to drive benefits for both business and society. To do this safely, they must also ensure they protect the privacy of the people represented in the data. Linking different sensitive personal data assets requires matching people across the datasets involved.

Tokenization is one approach that can enable the safe linkage of different datasets while hiding the identifying information used for matching in order to protect individual privacy. Organizations typically apply tokenization to reliable identifiers such as names, national IDs, or detailed demographics that exist in both datasets to safely link those datasets together.

Tokenization builds scrambled pieces of text, called tokens, by processing input text. Typically, identifiers (like a name, date of birth, or postcode) or portions of identifiers will be combined and cryptographically processed to produce a token. For example, a token could be generated by combining first name, last name, and date of birth. Note small changes to the identifiers cause drastic changes to the token.

First Name Last Name DOB Token1 Input Token1
Michael Bluth Dec 12, 1967 MICBLU19671214 29e2c5917ac1b7faa61f41fd1b9510262098e66f48411106186ef446358ebf2c
Michael Bluth Dec 11, 1967 MICBLU19671114 b796c8dcb32fd1177183a0d8be5cdf2e297c8f09cb77675f8a033ea95bf74edf

Linkage, when applied after tokenization, is the process of assigning matches between the tokens. The most straightforward linkage would be to match individuals that exactly match on all tokens.

DATASET 1

Token1 Token2 Token3
87f82 97908 733e2
8dba7 958f7 5d2e0
163a8 c0eae 27992

DATASET 2

Token1 Token2 Token3
ea2ed c9bed 2fa84
87f82 97908 733e2
98f7b 68295 8079b

Probabilistic linkage allows for inexact matching, tolerating some minor discrepancies in the source data. Probabilistic linking can let datasets link in situations where the source data may have typos, alternate spellings, or changing identifiers (e.g., changed home address, phone number, email, etc.) While this approach allows for flexibility in matching, it also carries the risk of making incorrect pairings.

Extending to de-identification and anonymization

Tokenization and linkage are often part of an initiative to produce de-identified or anonymized datasets. While secure tokenization and linkage can effectively disguise the identifiers used to generate tokens, this isn’t enough to claim that people in a dataset are unlikely to be identified.

It’s still possible that other information in the dataset could be used to identify people. So, to determine whether a dataset is de-identified or anonymized, a separate analysis is performed on the resulting linked dataset to ensure the chance of identification is sufficiently remote or to determine what changes to the data or workflow are necessary.

Protecting the security of tokenization

Securely generating tokens requires a good cryptographic function and good cryptographic key management. Standards organizations like ISO and NIST provide guidance on appropriately secure cryptographic algorithms that would be extremely unlikely to be reversible by even a capable adversary.

A tokenization algorithm will usually allow you to set secret keys (called a “salt” in some applications), which help ensure that the tokens generated from the same input data are unique and secure each time.

Both the secret keys and tokenized data must be managed carefully. If an adversary accesses the keys, reversing the tokens becomes much easier—or sometimes trivial! Even when keys are kept very secure, using private tokens to link de-identified data can increase the risk of identifying people in the dataset, as described above.

Also, if tokens or IDs are associated with some identified people, these could serve as a gateway for adversaries to exploit vulnerabilities in the tokenization system (for example, they may be able to reverse-engineer elements of the process).

Summary of concepts/definitions:

  • Tokenization is the process of replacing identifiers with tokens that maintain the distinctiveness of the original identifiers while hiding their actual values.
  • Linkage is the process of matching individuals using tokens, which can be done exactly or using probabilistic methods that allow for some discrepancies.
  • A dataset built from tokenized and linked data isn’t necessarily de-identified or anonymized, even if the source data is. A separate analysis is required to make this determination about the linked data.
  • Secret keys, used within the same tokenization algorithm, ensure the generated tokens are unique and secure, even from identical inputs.

Have questions? We have answers. Contact the experts at Privacy Analytics to learn more about data tokenization and linkage, including architecture, approaches, and tools for combining data assets.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.