BioGPT: The Innovative Generative Pre-trained Transformer (GPT) for Biomedical Text Understanding and Generation

A clear explanation of BioGPT and a practical use case of text generation in biomedical tasks

Josiah Adesola

Published in

Artificial Intelligence in Plain English

7 min readAug 8, 2023

Introduction

The world of Large Language Models (LLMs) has lately expanded, from the Google AI team’s BERT to more complex models like OpenAI’s GPT-4. LLM has been shown to be effective at a variety of NLP tasks, including text summarization, question answering, text generation, document classification, translation, and many more.

However, this has frequently been in the general domain. For example, the OpenAI team’s ChatGPT can be used to generate poems, code, and scripts, answer queries, create chatbots, and even translate languages. The problem is that GPT-3 and GPT-4 are trained with a broad scope. This means they have broad knowledge gleaned from a variety of books, articles, websites, and code repositories available on the internet. This has the minor drawback of not being precise in some delicate activities.

Let’s get more practical!

This shows a practical comparison between the GPT-2 model hosted and prototyped on the Hugging Face platform and a BioGPT test run on my Google Colab.

A comparison between GPT-2 and BioGPT by the Author

It is very evident that GPT-2 misinterpreted the context, not because it is dumb but because it is not domain-specific. The Google Colab Notebook prediction for CP-673451 was highly accurate, but the GPT-2 model on Hugging Face misunderstood the context and misinterpreted the word as a number.
BioGPT can be used for a variety of domain-specific activities such as drug-to-target extraction, medical document classification, medical text or science study summarization, and questions and answers on biomedical literature.

What is BioGPT?

BioGPT is an acronym that stands for Biomedical Generative Pretrained Transformers. BioGPT is a Microsoft Research-backed GPT model that has been pre-trained on a variety of biomedical topics, specifically 15 million PubMed abstracts.

The aim is to perform six biomedical NLP tasks that include: Question-answering on PubMedQA, relation extraction, document classification, and text generation, on benchmark datasets in the Biomedical space. These tasks are grouped into text generation and mining, which can also be called text understanding.

The datasets used for the BioGPT were gotten from PubMed’s website and cleaned to get only papers with both abstracts and titles. It was trained on GPT-2, which serves as the pillar of BioGPT. The team couldn’t use GPT-3 because it has a lot of parameters, up to 15 Billion.

It achieved 78.3% accuracy on PubMedQA, which is a great feat in the biomedical research space.

Pretraining models

Disclaimer: This is not a diversion from the material, but rather an important notion in LLMs.
Pre-training models, in simple terms, involve training your model on a wide range of datasets that are similar or dissimilar to the specific data for the task from which you want the model to learn. What we call “fine-tuning” comes after pre-training.

A good example is that, as a college student, you may take some optional courses purely for general information, and you may not be quizzed on them—this is pre-training. While fine-tuning occurs after you have gained some broad knowledge and possibly how to study better, you then attend a class tailored to your course of study — this is known as fine-tuning.

It is best to train models from scratch in LLMs before using them for downstream tasks. This is essential for your LLM so that it can acquire knowledge properly and answer questions or execute jobs that are too specific to a niche.

BioGPT, for example, was pre-trained on 15 million PubMed data items before being fine-tuned by altering the labels in the target sequence to improve output. A nice example is question-answering tasks. “The answer to the question is …” is the desired sequence. BioGPT was also improved in terms of document classification, end-to-end extraction, and text production. The model was trained on eight NVIDIA V100 GPUs with 64 accumulated steps before being fine-tuned on a single NVIDIA V100 GPU with 32 collected steps.

Before BioGPT? The technology and some underlying concepts

You must have had a thought while reading this post. What was there before BioGPT? Is it the first, or the second? It most obviously is not.

Let’s dive in quickly!

Microsoft Research’s paper on BioGPT proposes to address text generation restrictions in biomedical NLP problems. Let’s take a quick look at some historical context. Prior to BioGPT, we had a large number of models trained on benchmark datasets for performing biomedical NLP problems. BioBERT, PubMedBERT, and ElECTRAMed are examples of these models.

BioBERT and PubMedBERT

BERT is an acronym for Bidirectional Encoder Representation for Transformers, which was developed by Google researchers in 2018. Understanding the words before and following a missing word in a phrase is required. It is an upgraded model of the transformer architecture that works with sequential data and can go two-way (back and forth) like a person pacing back and forth.

Photo by Javier Allegue Barros on Unsplash

BioBERT and PubMedBERT use the BERT model to pre-train large-text biomedical papers and journals; PubMedBERT, on the other hand, is limited to pre-training on PubMed abstracts alone, with the goal of providing an understanding of abstracts or summaries of biological tasks. This performed admirably in sentiment analysis, text summarizing, and a variety of other tasks.

There was a significant constraint, which was text generation. This is where BioGPT enters the picture. It can complete the aforementioned tasks, outperform BioBERT and PubMedBERT, and generate language. You’re probably wondering why.

Because BERT works using an underlying principle known as Masked Language Modeling (MLM), the model conducts a “fill in the gap” task in a sentence. For example, “The cat ____ in the bag”, the model predicts “is”. The model’s inability to generate new words is hampered by its repeated checks for a contextual association between before and afterwards.
While BioGPT works using the Causal Language Model (CLM), the term “cause and effect” is drawn from a common expression. The model guesses the next word based on previous words but does not take into account future words. It learns the meaning and flow of different words in a sentence. It only goes one way, as opposed to BERT, which moves in two directions. For example “Malaria is _____”, the model predicts “a mosquito-borne disease.”

Who wants to code! Let’s get to it

Implementation of BioGPT with Hugging Face

The BioGPT model was stored on a hugging face and configured for easy use by anyone. I’ll walk you through the code.

Code Walkthrough

Import the necessary libraries— The transformer module was first imported with !pip . Then import the pipeline function for text generation, and set_seed to put randomness in check.
Load the BioGPT model — The BioGPT language tokenizer and model are imported from the transformers and initialize the class BioGptTokenizer and BioGptForCausalLM from the microsoft/biogpt
Create the text generation pipeline — This is created with the pipeline function. Then set_seed(42) is used to maintain reproducibility.
Text-based Input — The generator is used to generate text-based input. The parameters include input_text a variable, which you can edit. The max_length equals 200 tokens.

In summary, the code creates a text-generation pipeline and can produce several things with any given input.

Here’s my Google Colab Notebook for the demo.

Google Colaboratory

Edit description

colab.research.google.com

Check out the BioGPT explanation and use case on HuggingFace

BioGPT

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Demo

Conclusion

In conclusion, BioGPT is a cutting-edge achievement in NLP for biomedical tasks that has demonstrated improved accuracy on the GPT-2 XL architecture with 1.5B model parameters.
BioGPT had an F1 score of 44.98%, 38.42%, and 40.76% on the BC5CDR, KD-DTI, and DDI end-to-end extraction tasks, respectively, and 78.2% accuracy on PubMedQA.When compared to GPT-2 medium, BioBERT, PubMedBERT, and BioLinkBERT base, it demonstrated itself to be a better model.
Kudos to the Microsoft Research Team!

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord. Interested in Growth Hacking? Check out Circuit.