N-grams and their use in Text Generation

Published in

Artificial Intelligence in Plain English

5 min readJan 10, 2021

In Natural Language Processing (NLP), it is not only important to make sense of words but context too.

N-grams are one of the ways to understand the language in terms of context to better understand the meaning of words written or spoken.

For example, “I need to book a ticket to Australia.” versus “I want to read a book of Shakespeare.”:

Here the word “book” has different meanings altogether.

In the first sentence, it is used as a verb, which is the action while in the second sentence, it is used as a noun, which is an object.

We are able to understand it as we have learned it from childhood in the context of the sentence, it is used or the words used before or after the book in the sentence:

Now the question arises, how in NLP does a machine understand, what’s the context of a word?

Machines learn this by seeing the words before and after the word to know about its context.

The answer is through N-grams.

Bigrams are splitting a sentence into a pair of 2 to see the context.

A machine can understand if an article is used before the book, its Noun, also if ‘read’ is used in a sentence it is of course a reading book.

Trigrams are splitting a sentence into 3 sets of words to know about the context. The bigger the window, the harder it is to pick up the words in vocabulary in context.

N-grams define the number of words, one needs to look at to see the context.

An example is as follows:

Bull in the first sentence is an animal while in the second it refers to the share market. One can also use this to know about the negative context of words like:

“The movie was not nice, awful really.”

The words before and after nice (not nice, nice, awful), cancel the meaning of the positive word nice.

We can also capture sarcasm by this. Sarcasm is an ironic or satirical remark tempered by humor.

An example of the sentence is:

“You are intelligent…not.” It actually means you are not intelligent.

Although a lot of other types of sarcasm like in tonality or remarks or as question are being explored and enriched as we talk.

We use N-grams or pairs of words to know the broader context of the text which is then provided in machine learning to know the real meaning of the text.

N-grams is a simple yet effective approach in Natural language processing to know about the context of words.

Now let’s see how to implement it practically in Python.

Begin by installing the following packages:

!pip install -U pip
!pip install -U dill
!pip install -U nltk==3.4

Import the following libraries:

from nltk.util import pad_sequence
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import flatten

Example sentences to use:

text = [['I','need','to','book', 'ticket', 'to', 'Australia' ], ['I', 'want', 'to' ,'read', 'a' ,'book', 'of' ,'Shakespeare']]

Bigrams can be seen by using the following:

list(bigrams(text[0]))

The output is as seen:

[('I', 'need'),
 ('need', 'to'),
 ('to', 'book'),
 ('book', 'ticket'),
 ('ticket', 'to'),
 ('to', 'Australia')]

N-grams can be seen by using the following:

list(ngrams(text[1], n=3))

The output is as seen:

[('I', 'want', 'to'),
 ('want', 'to', 'read'),
 ('to', 'read', 'a'),
 ('read', 'a', 'book'),
 ('a', 'book', 'of'),
 ('book', 'of', 'Shakespeare')]

Now let’s try an implementation of the N-gram in text generation, for this, let’s import the IMDB review database from Kaggle and put it in a data frame:

import pandas as pd
df = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df.head()

Import the tokenize library:

from nltk import word_tokenize, sent_tokenize

To know about tokenization see my blog. Now tokenize the review column:

corpus = list(df['review'].apply(word_tokenize))

and import ‘everygram’ pipeline library

from nltk.lm.preprocessing import padded_everygram_pipeline

Apply “n-gram” to the corpus:

# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, corpus)

Now define a maximum likelihood model:

from nltk.lm import MLE
model = MLE(n) # Lets train a 3-grams model, previously we set n=3
model.fit(train_data, padded_sents)

Generate sentences from the model after detokenizing the content:

from nltk.tokenize.treebank import TreebankWordDetokenizerdetokenize = TreebankWordDetokenizer().detokenizedef generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

and then generate a sentence based on the text in the data frame:

generate_sent(model, num_words=20, random_seed=51)

This yields:

"Ocean's 12 . For example,as calling her a copy anywhere,and burning alive in a dilapidated"

Generate another sentence based on the text in the data frame:

generate_sent(model, num_words=10, random_seed=2)

which gives:

'waste your time with two wise scenes and was very'

Conclusion

Text generation through N-grams is a basic method for text generation. RNN and LSTMS are used for more refined generations. The context in N-grams can come with a lot of noise, stop words can be removed to make the text cleaner.

While there are a lot of advantages to N-grams, there are also some disadvantages, let's have a look at them.

Advantages of N-grams

It gives insight at different levels. (bigram, trigram, N-gram).

2. Simple and conceptually easy to understand.

Disadvantages of N-grams

We may need to use stop words to avoid any noise in results.

2. A count may not necessarily indicate importance to text or entity.

I hope after reading this blog the concept of N-gram and context is more clear in NLP, there are a lot of other methods as well for deriving context although it is the basic one. We will talk more about it, stay tuned.

Thanks for reading!

Originally published at https://www.numpyninja.com on January 10, 2021.

N-grams and their use in Text Generation

Written by Namrata Kapoor