A ‘Back To Basics’ Artificial Intelligence Project

Aaron Menezes
Artificial Intelligence in Plain English
7 min readJan 20, 2021

--

A lot of people enter artificial intelligence and machine learning, thinking they require a myriad of elaborate packages. Well, I am going to dispel this illusion today as I take you through this Python project of mine where I developed the most comprehensive K-Neighbors A.I. algorithm (or KNN, for short) from scratch using just default libraries (with a little surprise at the end).

Before I take you through my project, I feel I should at least explain what K-Neighbors algorithm is. It is essentially a supervised classification algorithm. It might seem a little complicated but, don’t despair. Let me take it step-by-step.

Machine learning algorithms are of 2 types : supervised and unsupervised. Supervised algorithms train the machine based data with the correct outputs provided, whereas unsupervised algorithms train using just the input data. Furthermore, a machine learning algorithm perform broadly 2 types of task: classification and regression. Classification means classifying the observations based on similarities in their traits and regression refers to finding a definite value after computing the inputs.

KNN being a supervised classification algorithm uses classified data to train the machine, following which, it can classify any observation you enter. So, without any further ado, let’s get right into it!

How does KNN work?

The working of KNN often gets too technical to understand but I am going to try my best to simplify it. It is a fair assumption that similar observations will belong to the same class. But here, the problem of measuring similarity arises. This can be resolved quite easily. Imagine a scatter plot.

Now, points that are close to each other will be of similar values and distant points will be dissimilar. So, all we have to do is find the closest points to the observation and based on their classes, classify the new observation. KNN uses a lot of distance metrics but the most popular one is Euclidean distance. For two observations of n-dimensions, say:

Two n-dimensional points

then the Euclidean distance between the two observations, is given by:

Euclidean Distance formula

KNN takes the classes of ‘K’ closest points to the new observation and classifies it to the class occurring the most number of times among the ‘K’ points. Here, the ‘K’ value can be neither too small or too large. However, the determination of this value is purely experimental and you will have to tinker around and find the most appropriate value, with the highest accuracy.

Coding Time!

Add the necessary libraries. For this project, I only made use of standard Python libraries:

import numpy as np
import pandas as pd
from math import sqrt
from collections import Counter
from sklearn.metrics import accuracy_score
#imports all the necessary libraries

Now, for the dataset I have made use of the popular IRIS dataset available with scikit-learn itself. All you need to do is load it and manipulate it like so:

from sklearn import datasetsiris = datasets.load_iris()
cols = ['SEPAL_LEN', 'SEPAL_WID', 'PETAL_LEN', 'PETAL_WID']
iris_df = pd.DataFrame(iris.data, columns = cols)
y = iris.target
iris_df['CLASS'] = y
#imports and loads the iris dataset

The data set looks like this:

First ten observations in the data.

Class takes three values here, namely: ‘0’ for setosa iris, ‘1’ for virginia iris and ‘2’ for versicolor iris. Graphically, each feature is distributed in this way:

Distribution of each feature with respect to class

The next thing I did was getting rid of any missing values, with this one line of code:

iris_df.dropna(inplace = True) #gets rid of missing observations

Calculating the Euclidean distance for all of these points might be quite taxing and will unnecessary clutter the processing part of the code. So to rectify this, I just created a function which will return the Euclidean distance between two observations when the observations are passed to it.

def EuclideanDist(newpoint, point):
dist = 0.0
l = len(newpoint)
#Used len function to allow use of different dimensional datasets
for i in range(0,l) :
dist += (newpoint[i] - point[i]) ** 2
dist = sqrt(dist)
return dist
#returns distance between newpoint and point

In keeping up with the ‘Back-To-Basics’ motif of this project, I ditched the widely used train_test_split function and instead opted to manually shuffle and split the data set. As an added bonus, I decided to take the fraction of the data used in the training set from the user. Here’s the code:

l = int(len(y))
test_frac = input('Enter fraction of training set : ')
test_frac = float(test_frac)
test_index = int(l * test_frac)
iris_df = iris_df.sample(frac = 1) #this shuffles the data
X_train = iris_df[0 : test_index]
X_test = iris_df[test_index : l] #divides the dataset
y_train = X_train['CLASS']
y_test = X_test['CLASS']
del X_train['CLASS']
del X_test['CLASS'] #deletes unnecessary columns
#forms training and test sets for X and y

For the ‘K’ value, instead of defining it in the program itself, I decided to take it from the user so that the user can set ‘K’ as per their assessment:

k = input('Enter k value : ') 
k = int(k)
#accepts 'k' from user

All this brings us to the most crucial part of the program where we actually predict using the test set and actually calculate the accuracy score of the algorithm. Let me just give you a short summary of this part. Basically, I calculated the distance of all the training points from every single test point one-by-one, sorted them in the ascending order, considered the nearest ‘K’ points and based on their classes, made a prediction for each test point. I repeated this procedure for every test point and saved the predictions for all these points in a list, which I then compared with the actual outputs in order to calculate the accuracy score:

predictions = [] #stores top 'k' predictionsfor i in X_test.index :
distances = [] #stores distances for point and clears post-use
index = [] #does the same with the indices of the points
pred = [] #does the same with top 'k' predictions
for j in X_train.index :
dist = EuclideanDist(X_test.loc[i], X_train.loc[j])
distances.append(dist)
index.append(j)
dis = pd.DataFrame({'DISTANCES' : distances , 'INDEX' : index})
dis = dis.sort_values('DISTANCES', ascending = True)
#sorts distances in ascending order, along with their index
ind = dis['INDEX'] #stores indices
ind = ind[0:k]
for z in ind :
pred.append(y_train.loc[z]) #stores top 'k' predictions
most_occ = Counter(pred)
o = most_occ.most_common()[0][0]
predictions.append(o) #saves final predictions for points
print('Accuracy Score is : {}'.format(accuracy_score(y_test, predictions))) #gives accuracy of the algorithm

After running this program with a training fraction of 0.8 and a ‘K’ value of 12, I got accuracies ranging from 0.9333 to 1.0:

93.333% accuracy score achieved
93.33% accuracy.
100% accuracy achieved!

Well, you must be thinking that the project is done. That is hardly the case!

After getting here, I still felt like something was missing and after some deliberation, I added a sort of user-interface part. Here, if the user desires, the algorithm can categorize a new observation based on the entire data, into one of the three classes after taking the inputs. Here it is:

a = input('Do you want to predict class for a new flower (1 - YES / 0 - NO) : ')
a = int(a)
if a == 1 :
iris_class = iris_df['CLASS'] #uses the entire dataset
del iris_df['CLASS']
new_values = input('Enter sepal length, width and petal length, width (separated by comma) :').split(',') #accepts values
for i in range(0,len(new_values)) :
new_values[i] = float(new_values[i])
print('New Observation Values are : {}'.format(new_values))
#converts the strings to floats and stores them
distances = []
index = []
predictions = []

for i in iris_df.index :
dist = EuclideanDist(new_values, iris_df.loc[i])
distances.append(dist)
index.append(i)
data = pd.DataFrame({'DISTANCES' : distances, 'INDEX' : index})
data = data.sort_values('DISTANCES')
pts = data['INDEX']
pts = pts[0:k]
for i in pts :
predictions.append(iris_class[i])
print('The top {} predictions are : {}'.format(k,predictions))
high_occ = Counter(predictions)
predict = high_occ.most_common()[0][0]
print('The flower belongs to class {}'.format(predict))
#applies the same logic to the new observation
else :
print('Thank you !') #prints exit message

This was the output :

User-interface input
Final output.

With this, the project is completed.

The key to this project was breaking down the working and the implementation of the algorithm into smaller blocks and taking everything one step at a time. My intention behind making this was to try and implement a machine learning algorithm using nothing but the default packages and at the same time make it as independent as possible, in the sense that with just a few changes in the data set and its columns the program should be able to perform on different databases. Furthermore, I wanted to give the user some freedom to change the ‘K’ value and at the same time predict the class of a new observation, all without having to dig around in the code to make changes.

I hope I was able to inspire you all to build such projects as well and do not forget to follow this account for updates about my upcoming projects.

--

--