We are going to implement one of the Machine Learning algorithms to predict a test data under classification mode. I’ve used supervised algorithm in which training data will be provided and test data manipulation will be processed for predictive analysis using Python integration.

**What is KNN Algorithm?**

K Nearest Neighbor is an algorithm used for classification and regression specific predictive analysis. It is one of the supervised algorithm widely used to predict the classification problems even though supporting regression too. KNN will work based on the arithmetic data manipulation technique using euclidean distance formula to find the nearest data point.

**Advantages of KNN**

- Easy to implement for non-linear form of data
- Better Accuracy
- Support both classification and regression

**Disadvantages of KNN**

- Only numbers specific data type can be processed
- More training data required for better results
- Performance issue if volume of data is high

**Example scenario:**

We are having historical data of physical shopping amount, online shopping amount and customer type label such as Platinum, Gold and Silver. Now, we need to predict the customer type for the new test data of physical shopping amount as 5,300Rs and online shopping amount as 10,800Rs.

Physical shopping = 5300

Online Shopping = 10800

We need to apply the Euclidean distance formula as stated below; Euclidean distance formula is used to calculate the space between two data points which is nothing but the shopping amounts;

SQRT((5300-D4)^2+(10800-E4)^2)

Now Euclidean value has been calculated based on the above applied formula and you can refer those calculated values in column âEuclideanâ for each test data.

Now K going to play the major role of finding the nearest neighbors which is nothing but the data points, consider K value as 3;

Please refer the âMin Euclidean Distanceâ column in which top 3 values in ascending order such as 1 is nearest neighbor, 2 is next and 3 is last. All three values belong to âGoldâ type of customer. Now the predicted customer type for the above value is âGoldâ;

You can try to change the shopping value inside the formula and find the nearest neighbor to identify the customer type.

**Python Integration:**

We can move towards the Python integration in which training data can be utilized from a flat file for predictive analysis outcome for test data; Used historical data of 40 different records for predicting the last label of customer type;

Sample historical data in flat file;

*# k-NN implementation using Python 2.7*

**import **csv

**import **random

**import **math

**import **operator

*# Load the historical data from flat file*

**def **loadDataset(filename, split, trainingSet=[], testSet=[]):

**with **open(filename, **‘r’**) **as **csvfile:

lines = csv.reader(csvfile)

dataset = list(lines)

**for **x **in **range(len(dataset) – 1):

**for **y **in **range(2):

dataset[x][y] = float(dataset[x][y])

**if **random.random() < split:

trainingSet.append(dataset[x])

**else**:

testSet.append(dataset[x])

*# Find the Euclidean distance*

**def **euclideanDistance(instance1, instance2, length):

distance = 0

**for **x **in **range(length):

distance += pow((instance1[x] – instance2[x]), 2)

**return **math.sqrt(distance)

*# Find the Nearest point K*

**def **getNeighbors(trainingSet, testInstance, k):

distances = []

length = len(testInstance) – 1

**for **x **in **range(len(trainingSet)):

dist = euclideanDistance(testInstance, trainingSet[x], length)

distances.append((trainingSet[x], dist))

distances.sort(key=operator.itemgetter(1))

neighbors = []

**for **x **in **range(k):

neighbors.append(distances[x][0])

**return **neighbors

*# Prediction Analysis based on the calculated Neighbours*

**def **getResponse(neighbors):

classVotes = {}

**for **x **in **range(len(neighbors)):

response = neighbors[x][-1]

**if **response **in **classVotes:

classVotes[response] += 1

**else**:

classVotes[response] = 1

sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)

**return **sortedVotes[0][0]

*# Calculate the Accuracy*

**def **getAccuracy(testSet, predictions):

correct = 0

**for **x **in **range(len(testSet)):

**if **testSet[x][-1] == predictions[x]:

correct += 1

**return **(correct / float(len(testSet))) * 100.0

*# Calling function for data analysis to give prediction*

**def **main():

*# prepare data
*trainingSet = []

testSet = []

split = 0.67

loadDataset(

**r’C:\Users\gopikannan\Desktop\KNN Algorithm\findata.txt’**, split, trainingSet, testSet)

**‘Train set: ‘**+ repr(len(trainingSet))

**‘Test set: ‘**+ repr(len(testSet))

*# generate predictions*

predictions = []

k = 3

**for**x

**in**range(len(testSet)):

neighbors = getNeighbors(trainingSet, testSet[x], k)

result = getResponse(neighbors)

predictions.append(result)

**‘> predicted=’**+ repr(result) +

**‘, actual=’**+ repr(testSet[x][-1]))

accuracy = getAccuracy(testSet, predictions)

**‘Accuracy: ‘**+ repr(accuracy) +

**‘%’**)

main()

**Console Output as follows;**

Train set: 25

Test set: 14

> predicted=’Silver’, actual=’Silver’

> predicted=’Silver’, actual=’Silver’

> predicted=’Gold’, actual=’Gold’

> predicted=’Platinum’, actual=’Platinum’

> predicted=’Silver’, actual=’Silver’

> predicted=’Gold’, actual=’Platinum’

> predicted=’Silver’, actual=’Silver’

> predicted=’Silver’, actual=’Silver’

> predicted=’Gold’, actual=’Gold’

> predicted=’Gold’, actual=’Gold’

> predicted=’Platinum’, actual=’Platinum’

> predicted=’Gold’, actual=’Gold’

> predicted=’Gold’, actual=’Gold’

> predicted=’Gold’, actual=’Gold’

**Accuracy: 92.85714285714286%**

Process finished with exit code 0

The above logic is available in my GitHub repo based on the below URL;

https://github.com/gopekanna/machineLearning

Happy Learning!!!

It’s very useful, thanks for sharing Gopi.

LikeLike