Luís Miguel Alves · Follow
--
Hello everyone, welcome to another one of my articles! Today, I would like to debate a bit about KNN, or K Nearest Neighbor! We will see how it works and how to use it. Come on?
Every project used is available on GitHub and you can access it by clicking here. There you will have access to all the codes used in the base for this article.
What will you see in this article?
- What is KNN and how it works;
- What is and how KNeighborsClassifier works;
- Why and when to use KNeighborsClassifier and KNN;
- KNN practical example with KNeighborsClassifier;
KNN , or K Nearest Neighbor is a Machine Learning algorithm that uses the similarity between our data to make classifications (supervised machine learning) or clustering (unsupervised machine learning).
With KNN we can have a certain set of data and from it draw patterns that can classify or group our data.
- But how exactly does it work?
Let’s think about your name first: K Nearest Neighbor
The concept of neighborhood depends on the idea that those close to us tend to be more like us.
From this notion, what KNN (very generically) does is create neighborhoods in our dataset and as we pass other data samples to the model it will return us on “which neighborhood our sample would best fit” !
Let’s see an example?
Note that for this example we have 3 different groups (or clusters) — blue, red and orange — Each of these represents a “neighborhood” with a “border” delimited by the gray circle at the bottom.
The basis of KNN is this, grouping data into clusters. From there, other algorithms do the job of classifying or grouping.
In today’s article I want to emphasize the KNeighborsClassifier.
KNeighborsClassifier is a supervised learning algorithm that makes classifications based on data neighbors. Like? Let’s take one more example:
Suppose we have a sample X (in this case, it’s the green dot). Having it in the plane, we can say an amount N of neighbors that we want and with this we can say if our sample is classified as Blue or Red.
Note that if we want to base a number N (also use the “K”) equal to 3 , we are defining a number of neighbors that will be used to define which class our sample most resembles.
In the example above, it is noted that with an N=3 we look for the 3 data closest to our test sample (in green) and verify that there are 2 blue samples and only 1 red, so we can classify our sample X as blue , as most neighbors classify themselves as such.
On the other hand, if we look for the 7 closest data (N=7) we will have a different situation: 4 samples in red and 3 in blue, thus, the result of our classification of sample X will be different from the previous one, now it will be classified with red!
What I want to give you here is: Changing the number K of neighbors can change the classification, ie it is a slightly sensitive parameter and should be chosen with caution!
Note: When I use the expression “closest data” , you can also understand it as “closest neighbors”
When we have less scattered data and few outliers , KNeighborsClassifier shines.
KNN in general is a series of algorithms that are different from the rest. If we have numerical data and a small amount of features (columns) KNeighborsClassifier tends to behave better.
When it comes to KNN , it is used more often for grouping tasks.
In the future I will bring an article about this and the use of KMeans , but today I will limit myself to showing the basis of KNN, this will serve us in the future!
Remembering again that all the code used in this article is available on Github and any doubt you can access it here and see more deeply the lines of code.
The database used is the ESRB game rating, that is, the age indication of a game. You can access it here.
Starting by importing used libraries….
import pandas as pdfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
By default, we do the division of our data into training and testing, for this case, we had already been given a dataset already divided into training and testing, so I will skip this step.
Let’s create our model and train it with our data:
# instantiating the model:
model = KNeighborsClassifier(n_neighbors=3)# training the model:
model.fit(X_train,y_train)
I started with a number N of neighbors equal to 3 just for testing (n_neighbors=3).
Note that there is not necessarily a “training” of the model, as we do not necessarily need to teach something to the model, as we test it, we see the results.
It’s time to see how the model fared….
# making predictions with the created model:
pred = model.predict(X_test)
# measuring model accuracy:
accuracy = accuracy_score(y_test , pred)
print("Acurácia : {}".format(round(accuracy*100,4))) print(classification_report(y_test , pred))
We can see that we even had good accuracy (80%) despite this low results in some classifications. Let’s try changing the N number to a value that can improve our performance, for that I’ll use GridSearchCV to find the best value:
k_list = list(range(1,61))k_values = dict(n_neighbors=k_list)grid = GridSearchCV(model , k_values, cv=6, scoring='accuracy')
grid.fit(pd.concat([X_train,X_test]), pd.concat([y_train,y_test]))
As a result, we received the value of N equal to 13, that is, with an n_neighbors=13 we obtained the best accuracy, around 81.8%, slightly higher than that seen previously. When we test this we have the following:
For this new case we obtained even more balanced results for our model. Perhaps some treatments in our database can improve the overall accuracy of the model, but that is not the purpose of today’s article.
I’ve already talked about it in my article: “How to deal with Unbalanced Classes in Machine Learning (Precision, Recall, Oversampling and Undersampling)”
In today’s article we looked at a bit of KNN (the basis for many “neighborhood” algorithms) and took a deep look at KNeighborsClassifier . We were able to see how it works from the inside, understand a little about the concept of “neighbors” and above all put into practice what we learned in a real classification project!
That’s it for today, in the future I’ll bring a kind of “part 2” of this article, debating about KMeans, until next time!