Predict with Precision: Master Classification Models with Python and R!

Data Science Classification use case.


Python programming language and its libraries combined together and R language in addition form the powerful tools for solving Classification analysis tasks.

Classification is one of the most fundamental concepts in data science.

Classification models in Python and R.
Classification models meme.

Python Knowledge Base: Make coding great again.
- Updated: 2024-09-12 by Andrey BRATUS, Senior Data Analyst.



    Classification is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. From a modeling perspective, classification requires a training dataset with many examples of inputs and outputs from which to learn. A model will use the training dataset and will calculate how to best map examples of input data to specific class labels. As such, the training dataset must be sufficiently representative of the problem and have many examples of each class label.



  1. Logistic Regression classification model


  2. Logistic Regression in Python


    
    #Importing the libraries
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    #Importing the dataset
    dataset = pd.read_csv('my_dataset.csv')
    X = dataset.iloc[:, :-1].values
    y = dataset.iloc[:, -1].values
    #Splitting the dataset into the Training set and Test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
    #Feature Scaling
    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    #Training the Logistic Regression model on the Training set
    from sklearn.linear_model import LogisticRegression
    classifier = LogisticRegression(random_state = 0)
    classifier.fit(X_train, y_train)
    #Predicting a new result - 2 featrures
    print(classifier.predict(sc.transform([[feature1_value,feature2_value]])))
    #Predicting the Test set results
    y_pred = classifier.predict(X_test)
    print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
    #Displaying the Confusion Matrix
    from sklearn.metrics import confusion_matrix, accuracy_score
    cm = confusion_matrix(y_test, y_pred)
    print(cm)
    accuracy_score(y_test, y_pred)
    


    Logistic Regression in R


    
    #Importing the dataset
    dataset = read.csv('my_dataset.csv')
    dataset = dataset[3:5]
    # Splitting the dataset into the Training set and Test set in R
    # install.packages('caTools')
    library(caTools)
    set.seed(123)
    split = sample.split(dataset$Target, SplitRatio = 0.75)
    training_set = subset(dataset, split == TRUE)
    test_set = subset(dataset, split == FALSE)
    # Feature Scaling
    training_set[, 2:3] = scale(training_set[, 2:3])
    test_set[, 2:3] = scale(test_set[, 2:3])
    # Fitting Logistic Regression to the Training set in R
    classifier = glm(formula = Target ~ .,
                     family = binomial,
                     data = training_set)
    # Predicting the Test set results
    prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
    y_pred = ifelse(prob_pred > 0.5, 1, 0)
    # Making the Confusion Matrix in R
    cm = table(test_set[, 3], y_pred > 0.5)
    

  3. K-Nearest Neighbors (K-NN) classification model


  4. K-Nearest Neighbors (K-NN) in Python


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Training the K-NN model on the Training set
    from sklearn.neighbors import KNeighborsClassifier
    classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
    classifier.fit(X_train, y_train)
    

    K-Nearest Neighbors (K-NN) in R


    
    #The differences in code from Logistic Regression (above) are Training and predicting the model steps
    #Fitting K-NN to the Training set and Predicting the Test set results
    # install.packages('class')
    library(class)
    y_pred = knn(train = training_set[, -3],
         test = test_set[, -3],
         cl = training_set[, 3],
         k = 5,
         prob = TRUE)
    


  5. Support Vector Machine (SVM) classification model


  6. Support Vector Machine (SVM) in Python


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Training the SVM model on the Training set
    from sklearn.svm import SVC
    classifier = SVC(kernel = 'linear', random_state = 0)
    classifier.fit(X_train, y_train)
    

    Support Vector Machine (SVM) in R


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Fitting SVM to the Training set
    # install.packages('e1071')
    library(e1071)
    classifier = svm(formula = Target ~ .,
             data = training_set,
             type = 'C-classification',
             kernel = 'linear')
    

  7. Kernel SVM classification model


  8. Kernel SVM in Python


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Training the Kernel SVM model on the Training set
    from sklearn.svm import SVC
    classifier = SVC(kernel = 'rbf', random_state = 0)
    classifier.fit(X_train, y_train)
    

    Kernel SVM in R


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Fitting Kernel SVM to the Training set
    # install.packages('e1071')
    library(e1071)
    classifier = svm(formula = Target ~ .,
             data = training_set,
             type = 'C-classification',
             kernel = 'radial')
    

  9. Naive Bayes classification model


  10. Naive Bayes in Python


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Training the Naive Bayes model on the Training set
    from sklearn.naive_bayes import GaussianNB
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    

    Naive Bayes in R


    
    #The differences in code from Logistic Regression (above) are Encoding the target and Training the model steps
    #Encoding the target as factor
    dataset$Purchased = factor(dataset$Target, levels = c(0, 1))
    #Fitting Naive Bayes to the Training set
    # install.packages('e1071')
    library(e1071)
    classifier = naiveBayes(x = training_set[-3],
                    y = training_set$Target)
    

  11. Decision Tree classification model


  12. Decision Tree in Python


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Training the Decision Tree model on the Training set
    from sklearn.tree import DecisionTreeClassifier
    classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    

    Decision Tree in R


    
    #The differenceы in code from Logistic Regression (above) are Training the model and predicting steps
    #Fitting Decision Tree to the Training set
    # install.packages('rpart')
    library(rpart)
    classifier = rpart(formula = Target ~ .,
               data = training_set)
    #Predicting the Test set results
    y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
    

  13. Random Forest classification model


  14. Random Forest in Python


    
    #The only difference in code from Logistic Regression (above) is Training the model step
    #Training the Random Forest model on the Training set
    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    

    Random Forest in R


    
    #he only difference in code from Logistic Regression (above) is Training the model step
    #Fitting Random Forest to the Training set
    # install.packages('randomForest')
    library(randomForest)
    set.seed(123)
    classifier = randomForest(x = training_set[-3],
                      y = training_set$Target,
                      ntree = 100)
    

  15. Classification models tips and features.


  16. Model: Logistic Regression.
    Pros: Probabilistic approach, gives informations about statistical significance of features.
    Cons: Logistic Regression Assumptions.

    Model: K-NN.
    Pros: Simple to understand, fast and efficient.
    Cons: Need to choose the number of neighbours k.

    Model: SVM.
    Pros: Performant, not biased by outliers, not sensitive to overfitting.
    Cons: Not appropriate for non linear problems, not the best choice for large number of features.

    Model: Kernel SVM.
    Pros: High performance on nonlinear problems, not biased by outliers, not sensitive to overfitting.
    Cons: Not the best choice for large number of features, more complex.

    Model: Naive Bayes.
    Pros: Efficient, not biased by outliers, works on nonlinear problems, probabilistic approach.
    Cons: Based on the assumption that features have same statistical relevance.

    Model: Decision Tree Classification.
    Pros: Interpretability, no need for feature scaling, works on both linear / nonlinear problems.
    Cons: Poor results on too small datasets, overfitting can easily occur.

    Model: Random Forest Classification.
    Pros: Powerful and accurate, good performance on many problems, including non linear.
    Cons: No interpretability, overfitting can easily occur, need to choose the number of trees.




See also related topics: