We are going to implement one of the Machine Learning algorithms to predict a test data under classification mode. I’ve used supervised algorithm in which training data will be provided and test data manipulation will be processed for predictive analysis using Python integration.
What is KNN Algorithm?
K Nearest Neighbor is an algorithm used for classification and regression specific predictive analysis. It is one of the supervised algorithm widely used to predict the classification problems even though supporting regression too. KNN will work based on the arithmetic data manipulation technique using euclidean distance formula to find the nearest data point.
Advantages of KNN
- Easy to implement for non-linear form of data
- Better Accuracy
- Support both classification and regression
Disadvantages of KNN
- Only numbers specific data type can be processed
- More training data required for better results
- Performance issue if volume of data is high
Example scenario:
We are having historical data of physical shopping amount, online shopping amount and customer type label such as Platinum, Gold and Silver. Now, we need to predict the customer type for the new test data of physical shopping amount as 5,300Rs and online shopping amount as 10,800Rs.
Physical shopping = 5300
Online Shopping = 10800
We need to apply the Euclidean distance formula as stated below; Euclidean distance formula is used to calculate the space between two data points which is nothing but the shopping amounts;
SQRT((5300-D4)^2+(10800-E4)^2)
Now Euclidean value has been calculated based on the above applied formula and you can refer those calculated values in column âEuclideanâ for each test data.
Now K going to play the major role of finding the nearest neighbors which is nothing but the data points, consider K value as 3;
Please refer the âMin Euclidean Distanceâ column in which top 3 values in ascending order such as 1 is nearest neighbor, 2 is next and 3 is last. All three values belong to âGoldâ type of customer. Now the predicted customer type for the above value is âGoldâ;
You can try to change the shopping value inside the formula and find the nearest neighbor to identify the customer type.
Python Integration:
We can move towards the Python integration in which training data can be utilized from a flat file for predictive analysis outcome for test data; Used historical data of 40 different records for predicting the last label of customer type;
Sample historical data in flat file;
# k-NN implementation using Python 2.7
import csv
import random
import math
import operator
# Load the historical data from flat file
def loadDataset(filename, split, trainingSet=[], testSet=[]):
with open(filename, ‘r’) as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset) – 1):
for y in range(2):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
# Find the Euclidean distance
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] – instance2[x]), 2)
return math.sqrt(distance)
# Find the Nearest point K
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance) – 1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
# Prediction Analysis based on the calculated Neighbours
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
# Calculate the Accuracy
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct / float(len(testSet))) * 100.0
# Calling function for data analysis to give prediction
def main():
# prepare data
trainingSet = []
testSet = []
split = 0.67
loadDataset(r’C:\Users\gopikannan\Desktop\KNN Algorithm\findata.txt’, split, trainingSet, testSet)
print ‘Train set: ‘ + repr(len(trainingSet))
print ‘Test set: ‘ + repr(len(testSet))
# generate predictions
predictions = []
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print(‘> predicted=’ + repr(result) + ‘, actual=’ + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print(‘Accuracy: ‘ + repr(accuracy) + ‘%’)
main()
Console Output as follows;
Train set: 25
Test set: 14
> predicted=’Silver’, actual=’Silver’
> predicted=’Silver’, actual=’Silver’
> predicted=’Gold’, actual=’Gold’
> predicted=’Platinum’, actual=’Platinum’
> predicted=’Silver’, actual=’Silver’
> predicted=’Gold’, actual=’Platinum’
> predicted=’Silver’, actual=’Silver’
> predicted=’Silver’, actual=’Silver’
> predicted=’Gold’, actual=’Gold’
> predicted=’Gold’, actual=’Gold’
> predicted=’Platinum’, actual=’Platinum’
> predicted=’Gold’, actual=’Gold’
> predicted=’Gold’, actual=’Gold’
> predicted=’Gold’, actual=’Gold’
Accuracy: 92.85714285714286%
Process finished with exit code 0
The above logic is available in my GitHub repo based on the below URL;
https://github.com/gopekanna/machineLearning
Happy Learning!!!
It’s very useful, thanks for sharing Gopi.
LikeLike