Clustering with K-Means (Numerical and textual Data)

3 min readDec 15, 2020

Clustering is another ML approach into data. K-means is a basic unsupervised learning algorithm. This algorithm can be used to do basic clustering related processes. Here will see how to do a basic clustering model with textual and numeric data.

Ok, first we have to make sure python is configured properly in your computer. Im using python 3.8.5 here. To make sure python is installed properly,

python --version
python3 --version

If you're using python 2, use the first command in the terminal. If you're using python 3, use the second command. If you get the version, you are good to go. If not install python and configure properly to move further. You can refer the documentation for that.

Okay, now will start by creating a project in you favourite IDE. I will be using pycharm. You can use Jupyter Lab as well. But Jupyter Lab is better for testing purposes. If you’re going into production, then I’ll suggest pycharm.

Now install the following libraries to your project. In pycharm you can go to Settings -> Project-> Python Interpreter and install the required libraries. If you’re using jupyter lab, you’ll have to pip install the libraries needed.

sklearn,pandas matplotlib,scipy

We will be using the K-Means algorithm from the sklearn library. Now will go ahead and create a python file in the project and import the required parts of the libraries we installed.

from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

Now we have imported the libraries, will get the datasets we are gonna use.

dataset1 = pd.read_csv (‘Dataset1.csv’)
dataset2 = pd.read_csv (‘Dataset2.csv’)

I’ve named my dataset files as “Dataset1” and “Dataset2”. They are csv files. Its the usual dataset file type. Dataset1 contains numeric data and Dataset2 contains textual data. This is how my datasets looks like..

As you can see, this datasets contains numeric (integer and float) and textual data. The numeric data can be directly used into the algorithm. But the textual data cannot be used directly. We need to vectorize the textual data in order to use with K-Means.

Moving on, now will look into how to vectorize the data. For this we will be using the “TfidfVectorizer”. Will create a vectorizer for english and vectorize the textual data.

vectorizer = TfidfVectorizer(stop_words=’english’)
X = vectorizer.fit_transform(dataset2[‘A’])
X2 = vectorizer.fit_transform(dataset2[‘B’])
X3 = vectorizer.fit_transform(dataset2[‘C’])

Now you have vectorized the data by column by column separately. Now we have to bind the data together to fit them to the algorithm. While binding the data, don’t forget to do a hstack bind.

d  = hstack((X, X2, X3))

Since the data is ready, we can go ahead and create the model and start the training process

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(d)

After execution, the process of training is done. The data is matched into 3 clusters as we have specified. If you have to see the clusters, you can simply print y_predicted and have a look.

We had to follow this vectorization process because we used textual data. If we used Dataset1, we can simply fit the data into the model.

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(dataset1)

So, as you learnt here, with k-means you can create clusters with numerical and textual data. Next what we will have to look into is, how to use this model in production. So that requires some other information which you will be able to read in my following articles.

Kudos

Clustering with K-Means (Numerical and textual Data)

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Jihan Jeeth

No responses yet