Classification models with Python and R.

Data Science Classification use case.


Python programming language and its libraries combined together and R language in addition form the powerful tools for solving Classification analysis tasks.

Classification is one of the most fundamental concepts in data science.

Classification models in Python and R.


Classification is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. From a modeling perspective, classification requires a training dataset with many examples of inputs and outputs from which to learn.A model will use the training dataset and will calculate how to best map examples of input data to specific class labels. As such, the training dataset must be sufficiently representative of the problem and have many examples of each class label.



Logistic Regression classification model


Logistic Regression in Python



#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('my_dataset.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Training the Logistic Regression model on the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
#Predicting a new result - 2 featrures
print(classifier.predict(sc.transform([[feature1_value,feature2_value]])))
#Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
#Displaying the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)


Logistic Regression in R



#Importing the dataset
dataset = read.csv('my_dataset.csv')
dataset = dataset[3:5]
# Splitting the dataset into the Training set and Test set in R
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[, 2:3] = scale(training_set[, 2:3])
test_set[, 2:3] = scale(test_set[, 2:3])
# Fitting Logistic Regression to the Training set in R
classifier = glm(formula = Target ~ .,
                 family = binomial,
                 data = training_set)
# Predicting the Test set results
prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
# Making the Confusion Matrix in R
cm = table(test_set[, 3], y_pred > 0.5)

K-Nearest Neighbors (K-NN) classification model


K-Nearest Neighbors (K-NN) in Python



#The only difference in code from Logistic Regression (above) is Training the model step
#Training the K-NN model on the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

K-Nearest Neighbors (K-NN) in R



#The differences in code from Logistic Regression (above) are Training and predicting the model steps
#Fitting K-NN to the Training set and Predicting the Test set results
# install.packages('class')
library(class)
y_pred = knn(train = training_set[, -3],
     test = test_set[, -3],
     cl = training_set[, 3],
     k = 5,
     prob = TRUE)


Support Vector Machine (SVM) classification model


Support Vector Machine (SVM) in Python



#The only difference in code from Logistic Regression (above) is Training the model step
#Training the SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

Support Vector Machine (SVM) in R



#The only difference in code from Logistic Regression (above) is Training the model step
#Fitting SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Target ~ .,
         data = training_set,
         type = 'C-classification',
         kernel = 'linear')

Kernel SVM classification model


Kernel SVM in Python



#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Kernel SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

Kernel SVM in R



#The only difference in code from Logistic Regression (above) is Training the model step
#Fitting Kernel SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Target ~ .,
         data = training_set,
         type = 'C-classification',
         kernel = 'radial')

Naive Bayes classification model


Naive Bayes in Python



#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

Naive Bayes in R



#The differences in code from Logistic Regression (above) are Encoding the target and Training the model steps
#Encoding the target as factor
dataset$Purchased = factor(dataset$Target, levels = c(0, 1))
#Fitting Naive Bayes to the Training set
# install.packages('e1071')
library(e1071)
classifier = naiveBayes(x = training_set[-3],
                y = training_set$Target)

Decision Tree classification model


Decision Tree in Python



#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Decision Tree model on the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

Decision Tree in R



#The differenceы in code from Logistic Regression (above) are Training the model and predicting steps
#Fitting Decision Tree to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Target ~ .,
           data = training_set)
#Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3], type = 'class')

Random Forest classification model


Random Forest in Python



#The only difference in code from Logistic Regression (above) is Training the model step
#Training the Random Forest model on the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

Random Forest in R



#he only difference in code from Logistic Regression (above) is Training the model step
#Fitting Random Forest to the Training set
# install.packages('randomForest')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-3],
                  y = training_set$Target,
                  ntree = 100)

Classification models tips and features.

Model: Logistic Regression.
Pros: Probabilistic approach, gives informations about statistical significance of features.
Cons: Logistic Regression Assumptions.

Model: K-NN.
Pros: Simple to understand, fast and efficient.
Cons: Need to choose the number of neighbours k.

Model: SVM.
Pros: Performant, not biased by outliers, not sensitive to overfitting.
Cons: Not appropriate for non linear problems, not the best choice for large number of features.

Model: Kernel SVM.
Pros: High performance on nonlinear problems, not biased by outliers, not sensitive to overfitting.
Cons: Not the best choice for large number of features, more complex.

Model: Naive Bayes.
Pros: Efficient, not biased by outliers, works on nonlinear problems, probabilistic approach.
Cons: Based on the assumption that features have same statistical relevance.

Model: Decision Tree Classification.
Pros: Interpretability, no need for feature scaling, works on both linear / nonlinear problems.
Cons: Poor results on too small datasets, overfitting can easily occur.

Model: Random Forest Classification.
Pros: Powerful and accurate, good performance on many problems, including non linear.
Cons: No interpretability, overfitting can easily occur, need to choose the number of trees.




See also related topics: