Building a De-identification Pipeline to Support RWE

To use patient-level data for RWE initiatives, biopharmaceutical companies must first obtain the patient’s consent that their data can be shared for secondary purposes; otherwise, they must de-identify the data. While most patients are willing to share their data for use in research, they also have an expectation that their privacy will be maintained. As a result, de-identification in some form is recommended, even if consent is obtained.

Furthermore, because biopharmaceutical companies operate in the global marketplace, it is prudent for them to follow standards and guidelines that pertain to the use and sharing of healthcare data. A number of respected and internationally recognized groups have published such guidance in recent years. Industry associations like the Health Information Trust Alliance (HITRUST) and government bodies like the Institute of Medicine (IOM) in the U.S. and the Canadian Council of Academies have all endorsed the use of a risk-based methodology to de-identify healthcare data.

Creating the Pipeline

Establishing a data de-identification pipeline helps apply risk-based de-identification automatically and consistently to data sets that are being continuously updated. Companies that have established a data warehouse for RWE purposes need to continuously refresh the data within it so that the information remains current. This permits analysts and researchers to have timely access to the most recently available data in a de-identified format, a situation that would be nearly impossible using manual processes. A de-identification pipeline pulls in data from the source, an EMR database for example, on a regular basis (e.g., monthly or quarterly). At this point, the automated de-identification engine would perform a series of steps to manipulate the database variables, reducing the risk of re- identification and protecting the patient’s privacy.


As with any risk-based de-identification approach, the first step is to assess the risk to privacy by looking at who will have access to the data and what security and privacy controls are in place to protect it from unauthorized access. Next, we need to classify the variables in the data that contain keys to an individual’s identity. While a data warehouse may consist of hundreds of data tables with thousands of variables, only some of these are relevant from a privacy perspective. The final step is to map the data. This ensures that the de-identified data maintains the integrity of the original database. With the work of the de-identification engine complete, the de-identified RWD can be exported to the data warehouse. There analysis can be run. The use of the pipeline limits the risk of a successful re-identification attack on the data warehouse since the warehouse only ever accepts data that is de-identified.

Automation, Yes – But Also Compliance

Establishing a de-identification pipeline not only lets biopharmaceutical companies automate the de-identification process when refreshing the content of their data warehouse, it also helps them to operate in a manner that is compliant with privacy legislation. By engaging with experts in the field of data de-identification, a de-identification pipeline can be implemented that follows legislation, like the HIPAA Privacy Rule. In the event of a data breach, the ability to show practices that comply with the legislation provides organizations with a defensible position.

Almost there – our last piece in our series on RWE: Final Thoughts on RWE.

Free Webinar: De-Identification 101

Join Privacy Analytics for a high level introduction of de-identification and data masking.
Watch now

Free Download: De-Id 101

You have Successfully Subscribed!