ANN without an “E” — part I

Implementing a basic Artificial Neural Network using Logistic Regression Analysis

Mina Suntea
Artificial Intelligence in Plain English

--

Photo by Robina Weermeijer on Unsplash

Artificial Neural Networks (ANN), simply known as Neural Networks (NN), are a set of algorithms that mimic the processes of the human brain. This is a form of reinforcement learning which, at the basic level, consists out of inputs, weights, a threshold and an output. This way the NN learns to improve its performance by itself.

The most popular and widely used learning algorithm used for NN’s is Logistic Regression, that solves classification problems (HUH? Regression for classification??? Yes, the naming of this algorithm is confusing but this name was given for historical reasons). The Linear Regression analysis estimates the coefficients of the linear equation, involving prediction of a value of a dependent variable using independent variable(s).

In this article the construction of a basic Artificial Neural Network will be shown step by step. Let’s get started on the most basic level, the parameters of just one single node in the whole NN. Just like in the human brain, activation is needed to let the neurons make connections. To mimic this in the NN, an activation function is needed and the most common type is the Sigmoid activation function, which is defined as:

This is implemented in Python in a way that x can either be a single value or an array of values. This can easily be implemented using the Numpy built-in functions:

The model for Logistic Regression used here is the model described by Ethem Alpaydin in his book Introduction to Machine Learning third edition:

This model computes every dependent variable given a matrix X, but this equation can be simplified by expanding the matrix X with a column filled with ones, representing the bias inputs:

Now the model for a single perceptron with 2 input variables becomes:

With this new simplified model, Y can simply be computed with a matrix multiplication in which all the single node outputs are stored.

To expand the X matrix by adding bias inputs, the function add_bias is written:

Note that until now only single node outputs have been considered, but NN’s are built from layers of these nodes. These layers consists of a combination of one or more Sigmoid nodes. By using multiple nodes on the output layer, it is possible to classify multiple classes, instead of having only binary classifications using a single node.

Each of the Sigmoid nodes will be needing a set of weights based on their inputs. Now that there will be one layer, the weights will be a matrix of the shape (n, m+1), where n is the node and m is the number of inputs plus the bias node. This matrix will be named Θ, where each row contains the weights for a Sigmoid node.

A single layer network, defining Θ⁰, looks like this:

3 input nodes plus 1 bias node, so it can distinguish 3 classes.

With the example given above, the matrix Θ⁰ will be of the size 3x4. In a more complex network the Θ matrix is indexed like this:

Which is the weight from node i in layer j to node k in layer j+1.

With the one_layer_init function the matrix Θ⁰ is created given the input and output size. It is important to randomly initialize the weights because partial derivatives will be used to find the optimal values and therefore every node should have a different derivative:

Now it is time to implement the functions that compute the activations for a single layer. Therefore the functions compute_layer and one_layer_output are written. The compute_layer function takes the matrix Aʲ and Θʲ and returns the following A matrix:

The one_layer_output takes a matrix X, that consists of training samples, and Θ⁰ and returns a matrix of outputs:

Gradient Descent

To obtain the best fit possible the cost function must be minimalized. With this more complex model, just computing the partial derivative won’t be sufficient. In the case of Linear Regression Gradient Descent is used to incrementally move all parameters to the gradient to minimize the cost. This cost function is defined as:

Where M is all training samples, K is all different outputs and where the current output is compared to the target output.

Implementing this in Python looks like this:

The Delta Rule

To minimize the cost function, the partial derivatives of each parameter must be taken. First δ is defined, which contains the partial derivatives of the cost function with respect to the nodes input. This is defined for each node i in each layer j:

For the output layer δ is defines as (j is the last layer):

To compute this, the function output_delta is written:

After computing the δ terms, the derivatives can be computed. The weight update is defined by the following equation:

When all the resulting values are summed over all M training samples, this will result in the overall gradient on the cost function for all parameters in one layer, also known as Batch Gradient Descent. The actual update of the weight is then just equal to the standard Gradient Descent method:

Implementing this for the weight update of one layer gives:

With all of these additional functions combined, the single layer network can be trained. Given a training data set, the outputs will just repeatedly be computed for each training input. Then de corresponding δ’s will be computed based on the training outputs, after which, lastly, the weights will be updated. The more iterations, the better the outcome and thus the lower the error:

Now the single layer network is completed, but we can do better! In the next article a more complex model will be implemented and discussed, that optimizes better than this basic single layer network.

--

--

I am an AI student, who loves to conduct research and learn new things. I also have a fascination for the criminal mind as well as culture.