Data Science Classification use case.
Python programming language and its libraries combined together and R language in addition form the powerful tools for solving Classification analysis tasks.
Classification is one of the most fundamental concepts in data science.
Python Knowledge Base: Make coding great again.
- Updated:
2024-09-12 by Andrey BRATUS, Senior Data Analyst.
Logistic Regression classification model
K-Nearest Neighbors (K-NN) classification model
Support Vector Machine (SVM) classification model
Kernel SVM classification model
Naive Bayes classification model
Decision Tree classification model
Random Forest classification model
Classification models tips and features.
Classification is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. From a modeling perspective, classification requires a training dataset with many examples of inputs and outputs from which to learn. A model will use the training dataset and will calculate how to best map examples of input data to specific class labels. As such, the training dataset must be sufficiently representative of the problem and have many examples of each class label.
Logistic Regression in Python
#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('my_dataset.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Training the Logistic Regression model on the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
#Predicting a new result - 2 featrures
print(classifier.predict(sc.transform([[feature1_value,feature2_value]])))
#Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
#Displaying the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
Logistic Regression in R
#Importing the dataset
dataset = read.csv('my_dataset.csv')
dataset = dataset[3:5]
# Splitting the dataset into the Training set and Test set in R
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[, 2:3] = scale(training_set[, 2:3])
test_set[, 2:3] = scale(test_set[, 2:3])
# Fitting Logistic Regression to the Training set in R
classifier = glm(formula = Target ~ .,
family = binomial,
data = training_set)
# Predicting the Test set results
prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
# Making the Confusion Matrix in R
cm = table(test_set[, 3], y_pred > 0.5)
K-Nearest Neighbors (K-NN) in Python
#The only difference in code from Logistic Regression (above) is Training the model step
#Training the K-NN model on the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
K-Nearest Neighbors (K-NN) in R
#The differences in code from Logistic Regression (above) are Training and predicting the model steps
#Fitting K-NN to the Training set and Predicting the Test set results
# install.packages('class')
library(class)
y_pred = knn(train = training_set[, -3],
test = test_set[, -3],
cl = training_set[, 3],
k = 5,
prob = TRUE)
Support Vector Machine (SVM) in Python
#The only difference in code from Logistic Regression (above) is Training the model step
#Training the SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
Support Vector Machine (SVM) in R
#The only difference in code from Logistic Regression (above) is Training the model step
#Fitting SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Target ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
Kernel SVM in Python
#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Kernel SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)
Kernel SVM in R
#The only difference in code from Logistic Regression (above) is Training the model step
#Fitting Kernel SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Target ~ .,
data = training_set,
type = 'C-classification',
kernel = 'radial')
Naive Bayes in Python
#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
Naive Bayes in R
#The differences in code from Logistic Regression (above) are Encoding the target and Training the model steps
#Encoding the target as factor
dataset$Purchased = factor(dataset$Target, levels = c(0, 1))
#Fitting Naive Bayes to the Training set
# install.packages('e1071')
library(e1071)
classifier = naiveBayes(x = training_set[-3],
y = training_set$Target)
Decision Tree in Python
#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Decision Tree model on the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Decision Tree in R
#The differenceы in code from Logistic Regression (above) are Training the model and predicting steps
#Fitting Decision Tree to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Target ~ .,
data = training_set)
#Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
Random Forest in Python
#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Random Forest model on the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Random Forest in R
#he only difference in code from Logistic Regression (above) is Training the model step
#Fitting Random Forest to the Training set
# install.packages('randomForest')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-3],
y = training_set$Target,
ntree = 100)
Model: Logistic Regression.
Pros: Probabilistic approach, gives informations about statistical significance of features.
Cons: Logistic Regression Assumptions.
Model: K-NN.
Pros: Simple to understand, fast and efficient.
Cons: Need to choose the number of neighbours k.
Model: SVM.
Pros: Performant, not biased by outliers, not sensitive to overfitting.
Cons: Not appropriate for non linear problems, not the best choice for large number of features.
Model: Kernel SVM.
Pros: High performance on nonlinear problems, not biased by outliers, not sensitive to overfitting.
Cons: Not the best choice for large number of features, more complex.
Model: Naive Bayes.
Pros: Efficient, not biased by outliers, works on nonlinear problems, probabilistic approach.
Cons: Based on the assumption that features have same statistical relevance.
Model: Decision Tree Classification.
Pros: Interpretability, no need for feature scaling, works on both linear / nonlinear problems.
Cons: Poor results on too small datasets, overfitting can easily occur.
Model: Random Forest Classification.
Pros: Powerful and accurate, good performance on many problems, including non linear.
Cons: No interpretability, overfitting can easily occur, need to choose the number of trees.