Photo by National Cancer Institute on Unsplash

Anonymization for non-structured content in EHR using Artificial Intelligence

Health data privacy is the main goal of information systems in the healthcare domain. This has been the case since the very first stages in the digitalization of services. We should all be aware of the importance of assuring our data is kept private. Of course, we do have to share this information with our healthcare provider but, what happens when our data is shared with third parties? This is when the Health Insurance Portability and Accountability Act (HIPAA) comes into play. Healthcare organizations must be compliant with the HIPAA regulation, but what is HIPAA? According to the U.S. Department of Health and Human Services (HHS), “The HIPAA Privacy Rule protects most ‘individually identifiable health information’ held or transmitted by a covered entity or its business associate, in any form or medium, whether electronic, on paper, or oral” [1]. The ‘individually identifiable health information’ is known as Personal Health Information or PHI and it is extremely useful for providing better healthcare outcomes for providers, developing clinical trials or discovering new drugs for health conditions, among many other purposes. This means that the way this data is shared must be clearly defined, assuring individuals’ privacy. The process to prepare health information to be shared with third parties without revealing PHI data is called ‘de-identification’ or ‘anonymization’. The Office for Civil Rights (OCR) identifies 2 methods to assure PHI privacy when managed by healthcare providers and business associates:

  • Expert determination
  • Safe Harbor

In this post we are going to focus on the Safe Harbor method, as it is able to reduce the human intervention required for anonymization purposes by automatic means. According to the Safe Harbor the following information cannot be present in a dataset that is going to be shared with third parties [1]:

# Personal Health Information
1 Names
2 Geographic data
3 Dates directly related to an individual (such as birth dates, discharge dates, admission dates, …)
4 Telephone numbers
5 Vehicle identifiers and serial numbers, including license plate numbers
6 Device identifiers and serial numbers
7 Fax numbers
8 Email addresses
9 Web Universal Resource Locators (URLs)
10 Social security numbers
11 Internet Protocol (IP) addresses
12 Medical record numbers
13 Biometric identifiers, including finger and voice prints
14 Health plan beneficiary numbers
15 Full-face photographs and any comparable images
16 Account numbers
17 Certificate/license numbers        
18 Any other unique identifying number, characteristic, or code, except as permitted for re-identification purposes

Electronic Health Records (EHR) include structured and non-structured data. The unstructured data is also known as “free text” and, according to HIPAA, this free text cannot contain PHI either. Perhaps it is easy to know when structured data contains PHI but, what happens when it is present in “free text”? The application of Natural Language Processing technology for this purpose is a must, especially if you do not want to burden a team of human experts with this work. Even if you consider devoting human resources to this task, what happens if you have thousands of records to review? It is unviable.

Natural Language Processing applied to anonymization tasks

There has been a lot of research around the application of NLP for de-identification purposes, not only in the health domain but also in many others (i.e. legal, finance).

Most of the well-known conferences around NLP, such as TAC, TREC, CLEF or I2B2 have explored this application. Unfortunately, the algorithms that have produced optimal results are supervised, that is, they require a correctly labeled training dataset and the labeling must be done by humans. The aforementioned conferences do however provide such datasets, therefore allowing participants to build their systems using datasets for training purposes.  

Published research highlights3 main approaches to the anonymization task:

Deep learning/machine learning (DL/ML) based systems

Deep learning has been applied to many fields and anonymization tasks are no exception. The combination of different types of neural networks (RNN, bidirectional LSTM, CNN) with word embeddings (Word2Vec, GloVe) to gather word contexts have been applied with very good results [2]. Other methods, like Conditional Random Fields (CRF) or Support Vector Machines (SVM), have also been tested. When applying these methods, careful attention must be paid to avoid overfitting; namely, adapting the generated models to the particularities of a collection used for training. On the other hand, generating word embeddings requires a large amount of input text and, to this day, no publicly available word embeddings for the clinical domain exist, unlike for generic English [3]. Finally, when the input data changes, particularly, when new ways to refer to PHI appear in the input data, the model needs to be re-trained using examples for those new cases. Although, it is not possible to be sure that the new model generated will ‘learn’ them.

Knowledge/rule-based systems

These approaches require the existence of lexical resources containing words that will likely be part of the PHI for identification. Likewise, additional grammatical rules must be developed to capture the context of the PHI mention (the purpose of these rules is similar to the purpose of the word embeddings). Of course, building these resources (the “knowledge” of the system) requires a fair amount of effort. Additionally, if various languages are considered, specific resources for each language must be available. As opposed to deep learning/machine learning approaches, when new ways to refer to the PHI emerge, adapting the rules that form the model is easier than re-training a deep learning model. Furthermore, the correct identification of the new PHI mention can be assured.

Hybrid systems

Hybrid methods try to combine the best of both worlds. Usually, the initial version of the model is built using deep learning or machine learning algorithms, while the fine tuning of the model is done by applying lexical or grammatical knowledge.  At Konplik Health, we have developed a knowledge-based system, due to the availability of lexical and grammatical resources of a good quality and coverage provided to us by our sister company, MeaningCloud. They have been working in the field of NLP for decades. Currently, we are working towards leveraging deep learning approaches to build a hybrid system.

But this will be the subject of another post you can read here.  And, if you cannot wait, feel free to ask for a demo!!

Contact us! 


  1. Guidance on De-identification of Protected Health Information, U.S. Department of Health and Human Services, November 26, 2012
  2. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inform Decis Mak 19, 232 (2019).
  3. Ahmed, T., Aziz, M.M.A. & Mohammed, N. De-identification of electronic health record using neural network. Sci Rep 10, 18600 (2020).