Increasingly, organizations are looking to link different data assets at the level of specific people to drive benefits for both business and society. To do this safely, they must also ensure they protect the privacy of the people represented in the data. Linking different sensitive personal data assets requires matching people across the datasets involved.
Tokenization is one approach that can enable the safe linkage of different datasets while hiding the identifying information used for matching in order to protect individual privacy. Organizations typically apply tokenization to reliable identifiers such as names, national IDs, or detailed demographics that exist in both datasets to safely link those datasets together.
Tokenization builds scrambled pieces of text, called tokens, by processing input text. Typically, identifiers (like a name, date of birth, or postcode) or portions of identifiers will be combined and cryptographically processed to produce a token. For example, a token could be generated by combining first name, last name, and date of birth. Note small changes to the identifiers cause drastic changes to the token.
First Name | Last Name | DOB | Token1 Input | Token1 |
---|---|---|---|---|
Michael | Bluth | Dec 12, 1967 | MICBLU19671214 | 29e2c5917ac1b7faa61f41fd1b9510262098e66f48411106186ef446358ebf2c |
Michael | Bluth | Dec 11, 1967 | MICBLU19671114 | b796c8dcb32fd1177183a0d8be5cdf2e297c8f09cb77675f8a033ea95bf74edf |
Linkage, when applied after tokenization, is the process of assigning matches between the tokens. The most straightforward linkage would be to match individuals that exactly match on all tokens.
DATASET 1
Token1 | Token2 | Token3 |
---|---|---|
87f82 | 97908 | 733e2 |
8dba7 | 958f7 | 5d2e0 |
163a8 | c0eae | 27992 |
DATASET 2
Token1 | Token2 | Token3 |
---|---|---|
ea2ed | c9bed | 2fa84 |
87f82 | 97908 | 733e2 |
98f7b | 68295 | 8079b |
Probabilistic linkage allows for inexact matching, tolerating some minor discrepancies in the source data. Probabilistic linking can let datasets link in situations where the source data may have typos, alternate spellings, or changing identifiers (e.g., changed home address, phone number, email, etc.) While this approach allows for flexibility in matching, it also carries the risk of making incorrect pairings.
Extending to de-identification and anonymization
Tokenization and linkage are often part of an initiative to produce de-identified or anonymized datasets. While secure tokenization and linkage can effectively disguise the identifiers used to generate tokens, this isn’t enough to claim that people in a dataset are unlikely to be identified.
It’s still possible that other information in the dataset could be used to identify people. So, to determine whether a dataset is de-identified or anonymized, a separate analysis is performed on the resulting linked dataset to ensure the chance of identification is sufficiently remote or to determine what changes to the data or workflow are necessary.
Protecting the security of tokenization
Securely generating tokens requires a good cryptographic function and good cryptographic key management. Standards organizations like ISO and NIST provide guidance on appropriately secure cryptographic algorithms that would be extremely unlikely to be reversible by even a capable adversary.
A tokenization algorithm will usually allow you to set secret keys (called a “salt” in some applications), which help ensure that the tokens generated from the same input data are unique and secure each time.
Both the secret keys and tokenized data must be managed carefully. If an adversary accesses the keys, reversing the tokens becomes much easier—or sometimes trivial! Even when keys are kept very secure, using private tokens to link de-identified data can increase the risk of identifying people in the dataset, as described above.
Also, if tokens or IDs are associated with some identified people, these could serve as a gateway for adversaries to exploit vulnerabilities in the tokenization system (for example, they may be able to reverse-engineer elements of the process).
Summary of concepts/definitions:
- Tokenization is the process of replacing identifiers with tokens that maintain the distinctiveness of the original identifiers while hiding their actual values.
- Linkage is the process of matching individuals using tokens, which can be done exactly or using probabilistic methods that allow for some discrepancies.
- A dataset built from tokenized and linked data isn’t necessarily de-identified or anonymized, even if the source data is. A separate analysis is required to make this determination about the linked data.
- Secret keys, used within the same tokenization algorithm, ensure the generated tokens are unique and secure, even from identical inputs.
Have questions? We have answers. Contact the experts at Privacy Analytics to learn more about data tokenization and linkage, including architecture, approaches, and tools for combining data assets.