Summer holidays are now almost forgotten, but this summer, besides enjoying the beach, the mountains, our family, our friends, … we have also enjoyed experimenting with text analytics in biomedical applications.

You have probably heard about BioCreative. Critical Assessment of Information Extraction in Biology, a research conference around information extraction and text mining applied to the biomedical domain. This initiative has been held since 2004. As in previous editions, several ‘tasks’ are proposed for researchers and practitioners to explore. These tasks are related to real world use cases, from the identification of drug-proteins interactions in biomedical literature to the topic classification of scientific publications around COVID-19 through the identification of medication names in tweets, which was the one we chose to tackle.

The goal of the task is to recognize mentions of drugs appearing in tweets. To some it may seem an easy task, but in cases like the following:

it may often be unclear that ‘hydros’ means ‘hydrocodone’ or, as it appears in the training collection, that ‘night quil’ refers to ‘Nyquil’ a drug that is used to treat cold or a flu symptoms.

Furthermore, the training collection provided is considerably unbalanced: with only around 218 of the 89,000 tweets containing a drug mention.

To take part in this initiative, Konplik Health has partnered with Hulat, the Human Language and Accessibility Technologies research group specialized in language technologies and accessibility at Universidad Carlos III de Madrid. We have fine-tuned a range of different models taking into account various large models currently available.

Given that it is impossible to obtain good labeling accuracy when working with such an unbalanced training collection, we increased the size of the collection, particularly, the number of tweets which talk about drugs. Using information available to us from UMLS, we created a linguistic resource containing drug names, synonyms, active ingredients as well as any information available in data sources like the CADEC dataset. This data collection contains social media publications about adverse drug events, which have been labeled manually by experts to highlight adverse events, drugs, symptoms and other relevant textual elements. The labeling is based on well-known standards like SNOMED and MedDRA. We used MeaningCloud’s Topic Extraction API to apply this new linguistic resource to tweets, so we could filter out those which did not containing drug mentions. With this, we were ultimately able to supplement the training set with more tweets containing drug mentions.
With regard to algorithms, we have tested BiLSTM plus Conditional Random Fields models and BERT based models, including BioBERT and BERTTweet .

At this point, you may be asking yourself, “and what was the accuracy of the results?”. Well, while we have made our own evaluations, we will, of course, have to wait for the official evaluation. Last Saturday, 18th September, we shared the result of the labeled evaluation data with our system. The organizers of this BioCreative task will share with us the resulting accuracy of the output we generated. So, we will see!! Fingers crossed.

We will keep you updated in future posts.

If you want to know more about Konplik’s experience extracting healthcare insights from Social Media, please click the button below.