KNN Algorithm Implementation using Python

We are going to implement one of the Machine Learning algorithms to predict a test data under classification mode. I’ve used supervised algorithm in which training data will be provided and test data manipulation will be processed for predictive analysis using Python integration.

What is KNN Algorithm?

K Nearest Neighbor is an algorithm used for classification and regression specific predictive analysis. It is one of the supervised algorithm widely used to predict the classification problems even though supporting regression too. KNN will work based on the arithmetic data manipulation technique using euclidean distance formula to find the nearest data point.

Advantages of KNN

  • Easy to implement for non-linear form of data
  • Better Accuracy
  • Support both classification and regression

Disadvantages of KNN

  • Only numbers specific data type can be processed
  • More training data required for better results
  • Performance issue if volume of data is high

Example scenario:

We are having historical data of physical shopping amount, online shopping amount and customer type label such as Platinum, Gold and Silver. Now, we need to predict the customer type for the new test data of physical shopping amount as 5,300Rs and online shopping amount as 10,800Rs.

Physical shopping = 5300

Online Shopping = 10800

We need to apply the Euclidean distance formula as stated below; Euclidean distance formula is used to calculate the space between two data points which is nothing but the shopping amounts;

SQRT((5300-D4)^2+(10800-E4)^2)

Now Euclidean value has been calculated based on the above applied formula and you can refer those calculated values in column ‘Euclidean’ for each test data.

Now K going to play the major role of finding the nearest neighbors which is nothing but the data points, consider K value as 3;

Please refer the ‘Min Euclidean Distance’ column in which top 3 values in ascending order such as 1 is nearest neighbor, 2 is next and 3 is last. All three values belong to ‘Gold’ type of customer. Now the predicted customer type for the above value is ‘Gold’;

You can try to change the shopping value inside the formula and find the nearest neighbor to identify the customer type.

1_1

Python Integration:

We can move towards the Python integration in which training data can be utilized from a flat file for predictive analysis outcome for test data; Used historical data of 40 different records for predicting the last label of customer type;

Sample historical data in flat file;

2_2testdata

# k-NN implementation using Python 2.7

import csv
import random
import math
import operator

# Load the historical data from flat file

def loadDataset(filename, split, trainingSet=[], testSet=[]):
with open(filename, ‘r’) as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset) – 1):
for y in range(2):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])

# Find the Euclidean distance

def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] – instance2[x]), 2)
return math.sqrt(distance)

# Find the Nearest point K

def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance) – 1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors

# Prediction Analysis based on the calculated Neighbours

def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]

# Calculate the Accuracy

def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct / float(len(testSet))) * 100.0

# Calling function for data analysis to give prediction

def main():
# prepare data
trainingSet = []
testSet = []
split = 0.67
loadDataset(r’C:\Users\gopikannan\Desktop\KNN Algorithm\findata.txt’, split, trainingSet, testSet)
print ‘Train set: ‘ + repr(len(trainingSet))
print ‘Test set: ‘ + repr(len(testSet))
# generate predictions
predictions = []
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print(‘> predicted=’ + repr(result) + ‘, actual=’ + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print(‘Accuracy: ‘ + repr(accuracy) + ‘%’)

main()

Console Output as follows;

Train set: 25

Test set: 14

> predicted=’Silver’, actual=’Silver’

> predicted=’Silver’, actual=’Silver’

> predicted=’Gold’, actual=’Gold’

> predicted=’Platinum’, actual=’Platinum’

> predicted=’Silver’, actual=’Silver’

> predicted=’Gold’, actual=’Platinum’

> predicted=’Silver’, actual=’Silver’

> predicted=’Silver’, actual=’Silver’

> predicted=’Gold’, actual=’Gold’

> predicted=’Gold’, actual=’Gold’

> predicted=’Platinum’, actual=’Platinum’

> predicted=’Gold’, actual=’Gold’

> predicted=’Gold’, actual=’Gold’

> predicted=’Gold’, actual=’Gold’

Accuracy: 92.85714285714286%

Process finished with exit code 0

 

The above logic is available in my GitHub repo based on the below URL;

https://github.com/gopekanna/machineLearning

Happy Learning!!!

One thought on “KNN Algorithm Implementation using Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s