FFNN-LM: Feed Forward Neural Network Language Model

Fun, Fast, Nifty, NLP — Language Model!

Published in

Artificial Intelligence in Plain English

5 min readJan 30, 2023

The foreground is a brain made of thread-like neural networks in blue and red. Circular lights blink on some neurons. Decorative black background. — Source: MIT

Welcome back, language model enthusiasts! In our last post, we covered the basics of what language models are and why they’re so useful. Today, we’re going to dive deeper and take a closer look at one specific type of language model: the Feedforward Neural Network Language Model (FFNN-LM).

What’s in a name?

First things first, let’s break down the name. “Feed Forward” refers to the way information flows through the model. Data enters the model at one end, is processed by multiple layers, and then exits at the other end with a prediction.

“Neural Network” refers to the architecture of the model, which is inspired by the structure of the human brain. It’s made up of interconnected “neurons” that work together to make predictions.

“Language Model” is self-explanatory — this model is designed to work with language data. For more information, refer to my last blog post on language models.

A feast for the eyes (and mind)

Now that you know what FFNN-LM stands for, let’s take a closer look at how it works.

At a high level, the model takes in a sequence of words (such as a sentence or a paragraph) and uses that to predict the next word in the sequence. The model is trained on a large dataset of text, such as books or articles, so that it can learn the patterns and relationships between words.

An image of Neural Network Architecture in which input of documents goes to embedding layer and then it passes through 3 hidden layers and generates output. — Source: Aylien

Input layer: This layer takes the raw text data and converts it into a numerical representation that can be processed by the network.
Embedding layer: This layer is responsible for converting the input words into a numerical representation called an embedding. The embeddings are learned during training and are used to represent the meaning of words in a continuous vector space.
Hidden layer(s): The hidden layer(s) is/are where the magic happens. This is where the neural network processes the input and generates a prediction. The number of hidden layers and the number of nodes in each layer can vary, but in general, the more hidden layers and nodes, the more powerful the model.
Output layer: This layer generates the final prediction of the model, which is a probability distribution over the possible output classes. The number of units in the output layer is equal to the number of classes in the data, and the activation function applied to the output layer is typically a softmax activation function. It is a function that maps the output of each neuron to a probability value between 0 and 1.

Let’s take a look at the architecture in more detail using the equations below.

Eq. 1

The hidden state at time t is calculated as the sigmoid activation of the dot product of the weight matrix W1 and input vector x_t, plus the bias vector b1. — Source: Author

In the above equation, hₜ is the hidden state of the neural network at time step t. The hidden state is computed using the activation function σ(.).

xₜ is the embedding of the input word at time step t.

W₁ and b₁ are the weight matrix and bias vector for the hidden layer, respectively. They are learned during the training process and are used to calculate the activation of the hidden units.

Eq. 2

The predicted output at time t is calculated as the softmax activation of the dot product of the weight matrix W2 and the hidden state vector h_t, plus the bias vector b2. — Source: Author

ŷₜ is the predicted probability distribution over all the possible output words at time step t. It represents the model’s prediction of what the next word should be, given the current input and hidden state.

softmax(.) is the softmax activation function. It takes in the dot product of the hidden state and the weight matrix W₂, and adds the bias vector b₂ to produce a probability distribution over all the possible output words.

In essence, these two equations describe how the neural network computes the hidden state and the output given the input and the learned parameters. The hidden state captures the internal representation of the input, while the output is a prediction of the next word in the sequence.

Advantages of FFNN-LMs

FFNN-LMs are powerful models, and their architecture offers several advantages that make them ideal for NLP tasks. Some of the benefits include:

Capacity to handle large amounts of data: FFNN-LMs can be easily trained on massive amounts of data, making them ideal for large NLP datasets.
Good representation capacity: With multiple hidden layers, FFNN-LMs can learn complex representations of the input data, making them well-suited for tasks such as language generation and translation.
End-to-end training: FFNN-LMs can be trained end-to-end, which means that they can be optimized for a specific task, such as sentiment analysis or language generation, without requiring any feature engineering.
Parallelizability: The architecture of FFNN-LMs is highly parallelizable, which means that the computations can be easily divided across multiple GPUs or CPUs, allowing for faster training times.

Disadvantages of FFNN-LMs

Despite their many benefits, FFNN-LMs also have some limitations. Some of the drawbacks include:

Computational Cost: Training FFNN-LMs can be computationally expensive, especially when the models are large and trained on large datasets.
Overfitting: FFNN-LMs can be prone to overfitting, especially when trained on small datasets. Overfitting occurs when a model memorizes the training data instead of learning general patterns.
Lack of interpretability: FFNN-LMs can be difficult to interpret, making it challenging to understand how the model is making its predictions.

Final Thoughts

A girl picking the best solution from a cloud of lightbulbs. — Source: Adobe Stock

FFNN-LMs are powerful models that have been widely used in NLP tasks, such as sentiment analysis, language translation, and text generation. They offer several advantages, including the ability to handle large amounts of data, good representation capacity, and end-to-end training. However, they also have some limitations, such as computational cost and overfitting, which can make them challenging to use in practice.

In conclusion, FFNN-LMs are an important tool in the NLP toolkit, and they will continue to play a critical role in the development of NLP applications. Whether you’re building a chatbot or transcribing speech, FFNN-LMs are a valuable resource to have in your arsenal.

In the next post, we’ll be discussing the implementation of FFNN-LM. And if you’re having fun reading these posts, be sure to hit that "like" button and follow for more linguistic adventures!

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Build awareness and adoption for your tech startup with Circuit.