Privacy Analytics - Learn the Basics of Data Tokenization and Linkage

Privacy Analytics > Resources > Articles > Learn the Basics of Data Tokenization and Linkage

Learn the Basics of Data Tokenization and Linkage

An article by Brian Rasquinha, Associate Director, Solution Architecture, Privacy Analytics

Increasingly, organizations are looking to link different data assets at the level of specific people to drive benefits for both business and society. To do this safely, they must also ensure they protect the privacy of the people represented in the data. Linking different sensitive personal data assets requires matching people across the datasets involved.

Tokenization is one approach that can enable the safe linkage of different datasets while hiding the identifying information used for matching in order to protect individual privacy. Organizations typically apply tokenization to reliable identifiers such as names, national IDs, or detailed demographics that exist in both datasets to safely link those datasets together.

Tokenization builds scrambled pieces of text, called tokens, by processing input text. Typically, identifiers (like a name, date of birth, or postcode) or portions of identifiers will be combined and cryptographically processed to produce a token. For example, a token could be generated by combining first name, last name, and date of birth. Note small changes to the identifiers cause drastic changes to the token.

First Name	Last Name	DOB	Token1 Input	Token1
Michael	Bluth	Dec 12, 1967	MICBLU19671214	29e2c5917ac1b7faa61f41fd1b9510262098e66f48411106186ef446358ebf2c
Michael	Bluth	Dec 11, 1967	MICBLU19671114	b796c8dcb32fd1177183a0d8be5cdf2e297c8f09cb77675f8a033ea95bf74edf

Linkage, when applied after tokenization, is the process of assigning matches between the tokens. The most straightforward linkage would be to match individuals that exactly match on all tokens.

DATASET 1

Token1	Token2	Token3
87f82	97908	733e2
8dba7	958f7	5d2e0
163a8	c0eae	27992

DATASET 2

Token1	Token2	Token3
ea2ed	c9bed	2fa84
87f82	97908	733e2
98f7b	68295	8079b

Probabilistic linkage allows for inexact matching, tolerating some minor discrepancies in the source data. Probabilistic linking can let datasets link in situations where the source data may have typos, alternate spellings, or changing identifiers (e.g., changed home address, phone number, email, etc.) While this approach allows for flexibility in matching, it also carries the risk of making incorrect pairings.

Extending to de-identification and anonymization

Tokenization and linkage are often part of an initiative to produce de-identified or anonymized datasets. While secure tokenization and linkage can effectively disguise the identifiers used to generate tokens, this isn’t enough to claim that people in a dataset are unlikely to be identified.

It’s still possible that other information in the dataset could be used to identify people. So, to determine whether a dataset is de-identified or anonymized, a separate analysis is performed on the resulting linked dataset to ensure the chance of identification is sufficiently remote or to determine what changes to the data or workflow are necessary.

Protecting the security of tokenization

Securely generating tokens requires a good cryptographic function and good cryptographic key management. Standards organizations like ISO and NIST provide guidance on appropriately secure cryptographic algorithms that would be extremely unlikely to be reversible by even a capable adversary.

A tokenization algorithm will usually allow you to set secret keys (called a “salt” in some applications), which help ensure that the tokens generated from the same input data are unique and secure each time.

Both the secret keys and tokenized data must be managed carefully. If an adversary accesses the keys, reversing the tokens becomes much easier—or sometimes trivial! Even when keys are kept very secure, using private tokens to link de-identified data can increase the risk of identifying people in the dataset, as described above.

Also, if tokens or IDs are associated with some identified people, these could serve as a gateway for adversaries to exploit vulnerabilities in the tokenization system (for example, they may be able to reverse-engineer elements of the process).

Summary of concepts/definitions:

Tokenization is the process of replacing identifiers with tokens that maintain the distinctiveness of the original identifiers while hiding their actual values.
Linkage is the process of matching individuals using tokens, which can be done exactly or using probabilistic methods that allow for some discrepancies.
A dataset built from tokenized and linked data isn’t necessarily de-identified or anonymized, even if the source data is. A separate analysis is required to make this determination about the linked data.
Secret keys, used within the same tokenization algorithm, ensure the generated tokens are unique and secure, even from identical inputs.

Visit Nuance.com to read the full story >

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.

Learn the Basics of Data Tokenization and Linkage

An article by Brian Rasquinha, Associate Director, Solution Architecture, Privacy Analytics

Extending to de-identification and anonymization

Protecting the security of tokenization

Summary of concepts/definitions:

More Articles

7 Questions to Evolve Your Privacy Strategy

3 Core Steps to Developing a Robust Privacy Strategy

How Context Affects Anonymization in AI Model Development

Data Privacy, AI, De-identification, and Anonymization: Putting It All Together

Why Context Matters When Anonymizing Data

How to Work Safely with Unstructured Text Data

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Join the next 5 Safes Data Privacy webinar