Photo by Adam Winger on Unsplash

Rare diseases are those affecting 1 in every 2000 people or, according to the FDA, diseases or conditions that affect less than 200,000 people in the United States. There are around 7,000 rare diseases identified. With such a low prevalence and such a high number of conditions, resource investment for the research of new treatments and drugs is low. What is more, primary care physicians are not used to treating patients with these diseases which makes them even more difficult to diagnose.

In this context, the availability of tools to facilitate access to information about orphan diseases are crucial and Natural Language Processing (NLP) can help solve this problem.


Terminology resources for rare diseases

There are some resources that can be useful to tune and configure NLP algorithms:

National Organization for Rare Diseases (NORD)

NORD is an independent organization dedicated to the identification, treatment and cure of rare disorders providing patient services, educational programs and supporting research. It includes more than 350 patient organizations. NORD shares information about rare diseases for patients, clinicians, researchers and patient organizations. On their website, you can access a comprehensive reference of rare diseases, including descriptions, causes, information about the affected population, related disorders, diagnosis, therapies and much more for known rare diseases. The website also includes information about clinical trials (from, resources for patients, caregivers and physicians and much more.


Orphanet is a resource intended to improve knowledge about rare diseases. For this purpose, Orphanet defines a coding system for rare diseases nomenclature, called ORPHAcode. It also provides a web portal dedicated to rare diseases and orphan drugs, funded by the European Commission. The nomenclature provided by Orphanet can be downloaded from their website.  

Disease Ontology

The Disease Ontology Project is an initiative which aims to develop an ontology gathering concepts and terminology for biomedical data related to human diseases. The project is led by the Institute of Genome Sciences, from the University of Maryland School of Medicine. The Disease Ontology is integrated with other well-known ontologies and terminology resources like ICD-9 and ICD-10, MeSH, SNOMED and UMLS. It is an open-source resource that can be downloaded from their website. This ontology goes beyond rare diseases, so it can be helpful for monitoring other diseases that are related to orphan diseases.


Data sources for rare diseases

With all these resources it is possible to build a NLP system with the ability to automate the search for rare diseases and related terms across different content, depending on the purpose:

PubMed, ClinicalTrials and medical and/or clinical information about rare diseases

The number of scientific publications published daily is growing over time. This means it is increasingly difficult to be aware of new research results regarding a particular disease, symptom or drug. The availability of terminological resources about rare diseases makes it possible to configure NLP algorithms to monitor new publications available at PubMed or new clinical trials published at, generating alerts when new information about a given rare disease is available. These NLP algorithms can also be configured to identify any treatments mentioned for a rare disease or even prescribed drugs reported.

EHRs from hospitals and healthcare organizations

A similar approach could be applied to EHRs available at hospitals or healthcare organizations. It may be easy to locate patients with a diagnosed rare disease but locating patients with compatible symptoms for clinical studies is generally more difficult. An automatic process capable of identifying compatible symptoms or other conditions that may be relevant in the identification of possible patients suffering from a given rare disease could be incredibly useful. Moreover, it is possible to monitor treatments that have been prescribed in such cases.

Specialized blogs and patient association websites

Undoubtedly, another relevant source of information is related with patients experiences and feelings about their life with a rare disease. Which symptoms do they find the most unpleasant? Which limitations do they report? What experiences have they had with available drugs? Specialized blogs, patient association websites, social media or any other channel used by patients to share their experiences is an extremely useful source of insight that can help improve treatments and, hence, quality of life for patients.  

There is an interesting initiative around data analytics applied to rare diseases. The Rare Disease Cures Accelerator-Data and Analytics Platform is an FDA funded project which provides an infrastructure to support the characterization of rare diseases, pursuing the acceleration of therapy development. This platform is intended to collate information on rare diseases, coming from clinical trials, registries or EHRs among other data sources. The data is curated and normalized, providing an actionable source of data for drug development solutions on rare diseases.

*Image taken from

In summary, with all this information and the insights provided by NLP and AI algorithms, it is possible to approach applications such as:

  • Support decision systems for physicians, helping in the diagnosis process for rare diseases
  • The search for patients for clinical trials that target rare diseases
  • Information systems for patients, reporting on new treatments or findings related to a given rare disease
  • Market analysis of available treatments and patient experience for pharmaceutical companies, gathering information that is relevant for launching new products or treatments for a rare disease.

Do you know about any other interesting data sources related to rare diseases? We would be happy to update this list of resources with your contributions!


BARRIERS TO RARE DISEASE DIAGNOSIS, CARE AND TREATMENT IN THE US: A 30-Year, Comparative Analysis, NORD report, November 19, 2020