Grouping the Unseen: Elevate Your Data Analysis with Python and R!

Data Science Clustering use case.


Python programming language and its libraries combined together and R language in addition form the powerful tools for solving Clustering analysis tasks.

Cluster analysis or simply clustering is a branch of machine learning ML which mainly dealt with unsupervised task and usually involves automatically discovering natural grouping in data.

Clustering models in Python and R.
Clustering models meme.

Python Knowledge Base: Make coding great again.
- Updated: 2024-07-26 by Andrey BRATUS, Senior Data Analyst.




    Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters using given features.

    In other words Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups - clusters.

    A cluster is often an area of density in the feature space where examples from the domain (observations or rows of data) are closer to the cluster than other clusters. The cluster may have a center (the centroid) that is a sample or a point feature space and may have a boundary or extent.


  1. K-Means Clustering model


  2. K-Means Clustering in Python



    
    #Importing the libraries
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    #Importing the dataset
    dataset = pd.read_csv('my_dataset.csv')
    #specifying 2 features for further visualisation
    X = dataset.iloc[:, [2, 3]].values
    #Elbow method to find the optimal number of clusters
    from sklearn.cluster import KMeans
    wcss = []
    for i in range(1, 11):
        kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 1)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)
    plt.plot(range(1, 11), wcss)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()
    #Training the K-Means model on the dataset
    kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
    y_kmeans = kmeans.fit_predict(X)
    #Visualising the clusters - 2 featrures
    plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
    plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
    plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
    plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
    plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
    plt.title('Clusters')
    plt.xlabel('Feature1')
    plt.ylabel('Feature2')
    plt.legend()
    plt.show()
    


    K-Means Clustering in R


    
    #Importing the dataset
    dataset = read.csv('my_dataset.csv')
    #specifying 2 features for further visualisation
    dataset = dataset[3:4]
    #Elbow method to find the optimal number of clusters
    #set.seed(123)
    wcss = vector()
    for (i in 1:10) wcss[i] = sum(kmeans(dataset, i)$withinss)
    plot(1:10,
         wcss,
         type = 'b',
         main = paste('Elbow Method'),
         xlab = 'Clusters',
         ylab = 'WCSS')
    
    # Fitting K-Means to the dataset
    set.seed(123)
    kmeans = kmeans(x = dataset, centers = 5)
    y_kmeans = kmeans$cluster
    # Visualising the clusters
    # install.packages('cluster')
    library(cluster)
    clusplot(dataset,
             y_kmeans,
             lines = 0,
             shade = TRUE,
             color = TRUE,
             labels = 2,
             plotchar = FALSE,
             span = TRUE,
             main = paste('Clusters'),
             xlab = 'Feature1',
             ylab = 'Feature2')
    

  3. Hierarchical Clustering model


  4. Hierarchical Clustering in Python


    
    #Importing the libraries
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    #Importing the dataset
    dataset = pd.read_csv('my_dataset.csv')
    #specifying 2 features for further visualisation
    X = dataset.iloc[:, [2, 3]].values
    #Dendrogram usage to find the optimal number of clusters
    import scipy.cluster.hierarchy as sch
    dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
    plt.title('Dendrogram')
    plt.xlabel('Observation points')
    plt.ylabel('Euclidean distances')
    plt.show()
    #Training the Hierarchical Clustering model on the dataset
    from sklearn.cluster import AgglomerativeClustering
    hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
    y_hc = hc.fit_predict(X)
    #Visualising the clusters - 2 featrures
    plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
    plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
    plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
    plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
    plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
    plt.title('Clusters')
    plt.xlabel('Feature1')
    plt.ylabel('Feature2')
    plt.legend()
    plt.show()
    

    Hierarchical Clustering in R


    
    #Importing the dataset
    dataset = read.csv('my_dataset.csv')
    #specifying 2 features for further visualisation
    dataset = dataset[3:4]
    #dendrogram method to find the optimal number of clusters
    dendrogram = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
    plot(dendrogram,
         main = paste('Dendrogram'),
         xlab = 'Observation points',
         ylab = 'Euclidean distances')
    
    # Fitting Hierarchical Clustering to the dataset
    hc = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
    y_hc = cutree(hc, 5)
    # Visualising the clusters
    # install.packages('cluster')
    library(cluster)
    clusplot(dataset,
             y_hc,
             lines = 0,
             shade = TRUE,
             color = TRUE,
             labels = 2,
             plotchar = FALSE,
             span = TRUE,
             main = paste('Clusters'),
             xlab = 'Feature1',
             ylab = 'Feature2')
    

  5. Clustering models tips and features.


  6. Model: K-Means.
    Pros: Simple to understand, easily adaptable, works well on small or large datasets, fast, efficient and performant.
    Cons: Need to choose the number of clusters.

    Model: Hierarchical Clustering.
    Pros: The optimal number of clusters can be obtained by the model itself, practical visualisation with the dendrogram.
    Cons: Not appropriate for large datasets.




See also related topics: