Across industries, demand for medical images is on the rise. This is due to increased use cases, such as AI, which can support applications for diagnostics, tracking disease progression, and planning or gauging the effectiveness of interventions. Demand is also driven by the more mature reuse of structured data, which has organizations looking to new sources.
When it comes to de-identifying or anonymizing images—particularly DICOM images—it can be unclear what options are available.
In this article, we will focus on the DICOM data format, exploring the challenges inherent in de-identifying DICOM data and discussing the presently available solutions.
What is DICOM?
The Digital Imaging and Communications in Medicine (DICOM) data format is widely used in the healthcare industry to store and share medical images, such as X-rays, MRIs, and CT scans.
The DICOM standard guides DICOM data formatting and comprises two main elements:
- Header data, which is semi-structured data containing metadata about the image, as well as patient contact information, treatment details, and medical history, and
- The image data itself (e.g., an X-ray image), which is also known as “pixel data.”
Header data can contain different types of identifiable information—such as a patient’s name, date of birth, and demographics—as well as organization-specific ID numbers.
While identified header data can make it easy for unauthorized individuals to single out patients, leading to a breach of patient privacy, identity theft, or other undesirable outcomes, this is easily remedied with standard de-identification techniques.
The conventional part of de-identifying DICOM data: Headers
In practice, de-identifying DICOM images means that identifiable information is removed or rendered non-identifiable from header data, image data, and file/folder naming.
As a starting point, it is necessary to transform direct identifiers like patient names, addresses, and ID numbers in the header data. If referential integrity is needed across DICOM images in a dataset due to multiple patient visits, or linkage is desired to other data modalities (e.g., structured data or clinical notes), replacement with synthetic values or encryption (for ID numbers) can be a viable option to transform such data.
Likewise, indirect identifiers such as patient DOB, age, and other demographics may need to be transformed via redaction, generalization, or replacement with synthetic values as appropriate to the laws or regulations governing the sharing and use of the data.
The file and folder names in the source data may also have patient identifiers; typical practice is to replace these names with newly generated names that are not based on any identifiers.
The complicated part of de-identifying DICOM data: Images
The pixel data of an image also needs to be considered in the de-identification.
For example, an image may include a clear view of the patient’s face or body or longitudinal scans of a patient’s head, allowing identification via facial recognition software. The imaging device may also print identifying text such as patient names, ID numbers, and care provider information onto the image (this is part of what is known as burnt-in text).
Redacting or otherwise obscuring identifying burnt-in text is a good general practice when de-identifying image data. However, caution is advised because some non-identifying burnt-in text, like measurements and technical settings, may add value to the data if it is practical to retain.
For small datasets, this can be achieved easily enough through manual effort. At scale, however, machine learning-driven text detection algorithms are often necessary to flag and remove burnt-in text in an automated fashion.
Similarly, in some cases, it may be necessary to employ image-defacing technologies to blur, redact, or otherwise delete data that reveals the details of a patient’s face or skull. As with burnt-in text removal, such technologies may be applied manually or automatically, depending on the desired scale. It’s worth noting, too, that in the case of, for example, brain MRIs, it may not be possible to transform the image data without severely compromising its utility.
Need help de-identifying DICOM data?
The safe sharing and reuse of DICOM data critically depends on scale and understanding the entire journey of both the image and the header data, including how it is being transformed and where and how it will ultimately be used.
To learn more about how your organization can safely and efficiently increase the utility of DICOM data in your care, download our DICOM Anonymization overview here.