Visual NLP: Bridging the Gap Between Text and Images

Published in

Artificial Intelligence in Plain English

4 min readOct 2, 2023

This article aims at explaining the key concepts of Visual NLP, in a simple way to provide an idea of what Visual NLP means, what are its use-cases, how can it be used, and why it is the future towards building automated extraction pipelines.

Prerequisites:

Basics of NLP
Optical Character Recognition

Let’s start with what is Visual NLP?

A branch of NLP which combine visual (spatial and layout) features and the textual information present in documents. Most of the classical NLP problems deal with text data which has a lot of information but still lacks the visual queues that help us differentiate between what the text says and what it means.

Given that we are in the era of AI LLMs like ChatGPT, Bard, Claude, etc. which are multi-modal in nature i.e. accept image and text both as inputs, we do see the potential of these systems.

One of the main reasons for moving towards Visual NLP is the need for Information Extraction on scanned documents. The IE activity is currently conducted by converting the scanned documents into text and running NLP on top of it.

Now, lets look at the limitations of this approach:

OCR Text Recognition failed due to the ambiguous representations of the text i.e. clarity, font, etc.
Visual Images that might add value to the text is not used.
Tabular data gets messed up when converted to text by OCR.

Addition of visual data helps overcome such challenges and gives enriched data to the model to work better on the tasks.

Some usecases of Visual NLP are:

Visual Document Classification (using text + spatial features + image)
Visual Question Answering
Layout Analysis: the process of analyzing the spatial arrangement of content in a document to understand its structure and meaning. This includes identifying the location of text, images, tables, and other elements, as well as the overall document structure, such as headings and subheadings.
Key Information Extraction: the process of extracting key information from documents and other visual content. This can include information such as names, dates, locations, and amounts.
Image Captioning: the task of generating a textual description of an image.
Table Detection: the task of identifying and locating tables in images and documents.
Table Structure Recognition: the task of identifying the logical and physical structure of a table. The logical structure of a table refers to the relationships between the different cells in the table, such as which cells are part of the same header row or column. The physical structure of a table refers to the layout of the table, such as the location of the borders and the spacing between the cells.

Some examples of how to harness the Visual NLP power

Key Information Extraction from Scanned Receipts

The aim of this task is to extract texts of a number of key fields from given receipts, and save the texts for each receipt image in a json file. We fine-tuned the Donut model to extract entities like company, address, date, total from scanned invoice receipts.

1.1. Sample Ground Truth

{     
"company": "BOOK TA .K (TAMAN DAYA) SDN BHD",     
"date": "25/12/2018",     
"address": "NO.53, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",     
"total": "9.00" 
}

The model was able to learn to extract these entities directly from images. We were able to get ~60% accuracy when considering the correct instances where ground truth and predicted text exactly matches.

2. Visual QA:

The aim of this task is to generate answers for the given questions from the image. We fine-tuned the Donut model on this task.

2.1. Sample Ground Truth

{
"gt_parses": [
  {
    "question": "what is AGE?", 
    "answer": "30"
  }, 
  {
    "question": "what is GENDER?", 
    "answer": "Female"
  }, 
  {
    "question": "what is DATE?", 
    "answer": "2023-01-07"
  }
]
}

The model is able to learn to generate answers directly from images.

Some Visual NLP Models that can be used through HuggingFace

Donut
Pix2Struct
LayoutLM Models
DiT

In the above examples, we have made use of Donut as a starting point to display the capability of Visual NLP systems but you can make use of any of the above models.

Future automated extraction pipelines with Visual NLP

Above examples demonstrate the clear potential in the current Visual NLP systems and why this area of research will be the future of automated extraction pipeline.

Visual NLP is a rapidly evolving field with the potential to revolutionize the way we process and understand information. By combining visual and textual features, visual NLP models can overcome the limitations of traditional NLP models and extract more accurate and comprehensive information from a wider range of sources, including scanned documents.

As the field of visual NLP continues to mature, we can expect to see even more innovative and groundbreaking applications emerge. For example, visual NLP could be used to develop new search engines that can understand and index both text and images, or to create new types of educational tools that can help students learn more effectively by combining visual and textual information.

In Plain English

Thank you for being a part of our community! Before you go:

Be sure to clap and follow the writer! 👏
You can find even more content at PlainEnglish.io 🚀
Sign up for our free weekly newsletter. 🗞️
Follow us on Twitter(X), LinkedIn, YouTube, and Discord.

Visual NLP: Bridging the Gap Between Text and Images

Prerequisites:

Let’s start with what is Visual NLP?

In Plain English

Written by Yatin Vij