Large language models (LLMs) have emerged as transformative tools, reshaping the landscape of natural language processing and understanding. Models of this type (such as BERT, LLaMA, and GPT-4) have experienced unprecedented and rapid growth, with varying impacts across industries. LLMs offer numerous benefits including their capacity to improve text generation and comprehension tasks, resulting in more precise and contextually relevant responses, and advancing research across various domains.
These advances in artificial intelligence, especially their broad access and availability to hoover up internet sources, have raised concerns about anonymization approaches and the risk of re-identification. Namely, linking disparate pieces of information whose relationship to one another was otherwise considered unknowable in the vast ocean of the internet, With such a powerful tool now available to the public, there is the potential for nefarious actors to use it in a way that undermines current practices to protect identities.
Let the good times roll
Prior to the emergence of tools like ChatGPT, LLMs have been used extensively in various applications, including the anonymization of unstructured information found in documents. By leveraging the language processing capabilities of these models, identifiable information can be replaced or obfuscated while maintaining the overall structure and coherence of the original data. This is something we do for clients, for example, as part of clinical trial transparency, so that trial sponsors can provide insightful data needed for regulatory submissions and to advance patient health.
Tools like ChatGPT can serve as junior copywriters and editors within organizations to quickly draft and review content. These models capitalize on the successes of well-known language processing tools like Grammarly and DeepL Write, employing sophisticated language generation capabilities to facilitate the production of content. While they provide valuable assistance, they are not infallible, and there is an art to skilfully prompting them for the best results. By framing prompts with clear instructions and context, organizations can improve the accuracy and relevance of the content generated, underscoring the importance of human guidance and expertise in using these tools to their full potential.
While previous models were primarily limited to researchers and developers, ChatGPT is publicly available with an intuitive interface that allows a wider range of people to take advantage of its language processing capabilities. This newfound accessibility empowers users to streamline tasks such as writing, content creation, and information retrieval. By making such a powerful language model available to a broader audience, ChatGPT fosters innovation, democratizes access to advanced language processing tools, and cultivates a more inclusive and collaborative environment for language-based endeavors.
Bad moon rising
The use of ChatGPT has sparked significant attention and raised concerns among data protection authorities, primarily due to the potential risks related to personal data usage and collection. As ChatGPT is a publicly accessible tool, there is the potential for misuse or unintended consequences stemming from the content it generates. To address these concerns, efforts are underway to develop models like privateGPT, allowing LLMs to be trained on new data without transmitting details back to the central server. This approach reduces the exposure of personal or confidential information.
The ability of publicly available LLMs, trained on vast amounts of publicly available data, to infer relationships and draw insights from gaps in information raises an as-yet untested hypothesis regarding re-identification risk. Namely that these models may infer identities from otherwise disconnected public information. Or that these models may be used to fill in gaps in knowledge that are needed to infer identities from otherwise de-identified information. This could therefore raise the bar on what is considered reasonable in terms of data sharing and assessments of identifiability by increasing residual risks of identification.
The potential concerns surrounding LLMs and the Increased risk of re-identification raise questions about the implications for data sharing. While there is a risk that these concerns may inhibit data sharing, we are constantly monitoring the threat landscape and adjusting our practices accordingly. And organizations that have immediate concerns with their approach to anonymization or de-identification, due to the possible threats introduced by LLMs, can have their practices evaluated or conduct motivated intruder testing.
Beat it on down the line
Our participation in developing an international standard on anonymization, ISO/IEC 27559, was exactly to provide this type of assurance and guidance. As a framework for anonymization and de-identification, this standard incorporates best practice to monitor and evaluate threats, vulnerabilities, and attacks, For example, implementing contractual restrictions and mitigating controls can help limit access and use of external data and tools that could otherwise compromise anonymity.
Enable the reuse and sharing of protected data
Incorporating LLMs such as ChatGPT into our motivated intruder tests could allow us to assess their effectiveness in making inferences and linking information. However, it is important to acknowledge that LLMs have a tendency to generate false details and inferences when trained with limited information, leading to an increased number of false flags. Surprisingly, this property may even contribute to privacy preservation efforts, although further research is needed. As we gain more experience in using LLMs to support motivated intruder testing, we will have more insights to share regarding how good they are as a tool in this context.
The interest in Privacy Enhancing Technologies (PETs), driven in part by government initiatives, provides an opportunity to explore pragmatic solutions to support the development and deployment of LLMs in practice. In this case, the tendency of LLMs to generate false information raises questions about their reliability for certain tasks, and creativity in others. New and evolving methods may be introduced that inform on the confidence of their responses, with data protection and privacy enhancements certainly on the heels of their deployment.
As the landscape of LLMs continues to evolve, ongoing research and engagement with data protection and privacy experts, public bodies, and regulators will help gain consensus on appropriate measures for responsible use. By maintaining a neutral perspective and considering the potential risks and opportunities, organizations can navigate the uncertainties surrounding LLMs and make informed decisions regarding privacy, data sharing, and technological advances.
Tools like ChatGPT pose certain risks when it comes to data sharing due to many of the challenges mentioned in this article. These risks can create concerns for organizations who want to enable the use of these tools without comprising the privacy of individuals. Contact us to learn more and see how our advisory or consulting services can help you enable the safe and responsible uses of protected data.