Clustering models with Python and R.

Data Science Clustering use case.


Python programming language and its libraries combined together and R language in addition form the powerful tools for solving Clustering analysis tasks.

Cluster analysis or simply clustering is a branch of machine learning ML which mainly dealt with unsupervised task and usually involves automatically discovering natural grouping in data.

Clustering models in Python and R.



Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters using given features.

In other words Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups - clusters.

A cluster is often an area of density in the feature space where examples from the domain (observations or rows of data) are closer to the cluster than other clusters. The cluster may have a center (the centroid) that is a sample or a point feature space and may have a boundary or extent.


K-Means Clustering model


K-Means Clustering in Python




#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('my_dataset.csv')
#specifying 2 features for further visualisation
X = dataset.iloc[:, [2, 3]].values
#Elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 1)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
#Training the K-Means model on the dataset
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
#Visualising the clusters - 2 featrures
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.legend()
plt.show()


K-Means Clustering in R



#Importing the dataset
dataset = read.csv('my_dataset.csv')
#specifying 2 features for further visualisation
dataset = dataset[3:4]
#Elbow method to find the optimal number of clusters
#set.seed(123)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(dataset, i)$withinss)
plot(1:10,
     wcss,
     type = 'b',
     main = paste('Elbow Method'),
     xlab = 'Clusters',
     ylab = 'WCSS')

# Fitting K-Means to the dataset
set.seed(123)
kmeans = kmeans(x = dataset, centers = 5)
y_kmeans = kmeans$cluster
# Visualising the clusters
# install.packages('cluster')
library(cluster)
clusplot(dataset,
         y_kmeans,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters'),
         xlab = 'Feature1',
         ylab = 'Feature2')

Hierarchical Clustering model


Hierarchical Clustering in Python



#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('my_dataset.csv')
#specifying 2 features for further visualisation
X = dataset.iloc[:, [2, 3]].values
#Dendrogram usage to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Observation points')
plt.ylabel('Euclidean distances')
plt.show()
#Training the Hierarchical Clustering model on the dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
#Visualising the clusters - 2 featrures
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.legend()
plt.show()

Hierarchical Clustering in R



#Importing the dataset
dataset = read.csv('my_dataset.csv')
#specifying 2 features for further visualisation
dataset = dataset[3:4]
#dendrogram method to find the optimal number of clusters
dendrogram = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
plot(dendrogram,
     main = paste('Dendrogram'),
     xlab = 'Observation points',
     ylab = 'Euclidean distances')

# Fitting Hierarchical Clustering to the dataset
hc = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
y_hc = cutree(hc, 5)
# Visualising the clusters
# install.packages('cluster')
library(cluster)
clusplot(dataset,
         y_hc,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters'),
         xlab = 'Feature1',
         ylab = 'Feature2')

Clustering models tips and features.


Model: K-Means.
Pros: Simple to understand, easily adaptable, works well on small or large datasets, fast, efficient and performant.
Cons: Need to choose the number of clusters.

Model: Hierarchical Clustering.
Pros: The optimal number of clusters can be obtained by the model itself, practical visualisation with the dendrogram.
Cons: Not appropriate for large datasets.




See also related topics: