Will ChatGPT Put Data Sharing at Risk?

Will ChatGPT Put Data Sharing at Risk?

An article by Devyani Biswal, Methodology Architect, Privacy Analytics, and Luk Arbuckle, Chief Methodologist, Privacy Analytics

Large language models (LLMs) have emerged as transformative tools, reshaping the landscape of natural language processing and understanding. Models of this type (such as BERT, LLaMA, and GPT-4) have experienced unprecedented and rapid growth, with varying impacts across industries. LLMs offer numerous benefits including their capacity to improve text generation and comprehension tasks, resulting in more precise and contextually relevant responses, and advancing research across various domains.

These advances in artificial intelligence, especially their broad access and availability to hoover up internet sources, have raised concerns about anonymization approaches and the risk of re-identification. Namely, linking disparate pieces of information whose relationship to one another was otherwise considered unknowable in the vast ocean of the internet, With such a powerful tool now available to the public, there is the potential for nefarious actors to use it in a way that undermines current practices to protect identities.

Let the good times roll

Prior to the emergence of tools like ChatGPT, LLMs have been used extensively in various applications, including the anonymization of unstructured information found in documents. By leveraging the language processing capabilities of these models, identifiable information can be replaced or obfuscated while maintaining the overall structure and coherence of the original data. This is something we do for clients, for example, as part of clinical trial transparency, so that trial sponsors can provide insightful data needed for regulatory submissions and to advance patient health.

Tools like ChatGPT can serve as junior copywriters and editors within organizations to quickly draft and review content. These models capitalize on the successes of well-known language processing tools like Grammarly and DeepL Write, employing sophisticated language generation capabilities to facilitate the production of content. While they provide valuable assistance, they are not infallible, and there is an art to skilfully prompting them for the best results. By framing prompts with clear instructions and context, organizations can improve the accuracy and relevance of the content generated, underscoring the importance of human guidance and expertise in using these tools to their full potential.

While previous models were primarily limited to researchers and developers, ChatGPT is publicly available with an intuitive interface that allows a wider range of people to take advantage of its language processing capabilities. This newfound accessibility empowers users to streamline tasks such as writing, content creation, and information retrieval. By making such a powerful language model available to a broader audience, ChatGPT fosters innovation, democratizes access to advanced language processing tools, and cultivates a more inclusive and collaborative environment for language-based endeavors.

Bad moon rising

The use of ChatGPT has sparked significant attention and raised concerns among data protection authorities, primarily due to the potential risks related to personal data usage and collection. As ChatGPT is a publicly accessible tool, there is the potential for misuse or unintended consequences stemming from the content it generates. To address these concerns, efforts are underway to develop models like privateGPT, allowing LLMs to be trained on new data without transmitting details back to the central server. This approach reduces the exposure of personal or confidential information.

The ability of publicly available LLMs, trained on vast amounts of publicly available data, to infer relationships and draw insights from gaps in information raises an as-yet untested hypothesis regarding re-identification risk. Namely that these models may infer identities from otherwise disconnected public information. Or that these models may be used to fill in gaps in knowledge that are needed to infer identities from otherwise de-identified information. This could therefore raise the bar on what is considered reasonable in terms of data sharing and assessments of identifiability by increasing residual risks of identification.

The potential concerns surrounding LLMs and the Increased risk of re-identification raise questions about the implications for data sharing. While there is a risk that these concerns may inhibit data sharing, we are constantly monitoring the threat landscape and adjusting our practices accordingly. And organizations that have immediate concerns with their approach to anonymization or de-identification, due to the possible threats introduced by LLMs, can have their practices evaluated or conduct motivated intruder testing.

Beat it on down the line

Our participation in developing an international standard on anonymization, ISO/IEC 27559, was exactly to provide this type of assurance and guidance. As a framework for anonymization and de-identification, this standard incorporates best practice to monitor and evaluate threats, vulnerabilities, and attacks, For example, implementing contractual restrictions and mitigating controls can help limit access and use of external data and tools that could otherwise compromise anonymity.

Enable the reuse and sharing of protected data

Find out how data governance can support access and effective use of valuable data assets. Read this 8-page white paper.

Incorporating LLMs such as ChatGPT into our motivated intruder tests could allow us to assess their effectiveness in making inferences and linking information. However, it is important to acknowledge that LLMs have a tendency to generate false details and inferences when trained with limited information, leading to an increased number of false flags. Surprisingly, this property may even contribute to privacy preservation efforts, although further research is needed. As we gain more experience in using LLMs to support motivated intruder testing, we will have more insights to share regarding how good they are as a tool in this context.

The interest in Privacy Enhancing Technologies (PETs), driven in part by government initiatives, provides an opportunity to explore pragmatic solutions to support the development and deployment of LLMs in practice. In this case, the tendency of LLMs to generate false information raises questions about their reliability for certain tasks, and creativity in others. New and evolving methods may be introduced that inform on the confidence of their responses, with data protection and privacy enhancements certainly on the heels of their deployment.

Satisfaction

As the landscape of LLMs continues to evolve, ongoing research and engagement with data protection and privacy experts, public bodies, and regulators will help gain consensus on appropriate measures for responsible use. By maintaining a neutral perspective and considering the potential risks and opportunities, organizations can navigate the uncertainties surrounding LLMs and make informed decisions regarding privacy, data sharing, and technological advances.

Tools like ChatGPT pose certain risks when it comes to data sharing due to many of the challenges mentioned in this article. These risks can create concerns for organizations who want to enable the use of these tools without comprising the privacy of individuals. Contact us to learn more and see how our advisory or consulting services can help you enable the safe and responsible uses of protected data.

Archiving / Destroying

Are you unleashing the full value of data you retain?

Your Challenges

Do you need help...

OUR SOLUTION

Value Retention

Client Success

Client: Comcast

Situation: California’s Consumer Privacy Act inspired Comcast to evolve the way in which they protect the privacy of customers who consent to share personal information with them.

Evaluating

Are you achieving intended outcomes from data?

Your Challenge

Do you need help...

OUR SOLUTION

Unbiased Results

Client Success

Client: Integrate.ai

Situation: Integrate.ai’s AI-powered tech helps clients improve their online experience by sharing signals about website visitor intent. They wanted to ensure privacy remained fully protected within the machine learning / AI context that produces these signals.

Accessing

Do the right people have the right data?

Your Challenges

Do you need help...

OUR SOLUTION

Usable and Reusable Data

Client Success

Client: Novartis

Situation: Novartis’ digital transformation in drug R&D drives their need to maximize value from vast stores of clinical study data for critical internal research enabled by their data42 platform.

 

Maintaining

Are you empowering people to safely leverage trusted data?

Your Challenges

Do you need help...

OUR SOLUTION

Security / compliance efficiency

CLIENT SUCCESS

Client: ASCO’s CancerLinQ

Situation: CancerLinQ™, a subsidiary of American Society of Clinical Oncology, is a rapid learning healthcare system that helps oncologists aggregate and analyze data on cancer patients to improve care. To achieve this goal, they must de-identify patient data provided by subscribing practices across the U.S.

 

Acquiring / Collecting

Are you acquiring the right data? Do you have appropriate consent?

Your Challenge

Do you need help...

OUR SOLUTIONS

Consent / Contracting strategy

Client Success

Client: IQVIA

Situation: Needed to ensure the primary market research process was fully compliant with internal policies and regulations such as GDPR. 

 

Planning

Are You Effectively Planning for Success?

Your Challenges

Do you need help...

OUR SOLUTION

Build privacy in by design

Client Success

Client: Nuance

Situation: Needed to enable AI-driven product innovation with a defensible governance program for the safe and responsible use
of voice-to-text data under Shrems II.

 

Join the next 5 Safes Data Privacy webinar

This course runs on the 2nd Wednesday of every month, at 11 a.m. ET (45 mins). Click the button to register and select the date that works best for you.