R for Regression.
The R programming language and its libraries combined together form a powerful tool for solving Regression analysis tasks.
Regression study is a predictive modelling method that analyzes the relation between the target or dependent variable and features or independent variables in a dataset.
Python Knowledge Base: Make coding great again.
- Updated:
2024-12-20 by Andrey BRATUS, Senior Data Analyst.
Data Preprocessing Template in RStudio
Support Vector Regression (SVR) in R
Decision Tree Regression in R
Random Forest Regression model in R
Regression models tips and features.
The different types of regression analysis methods are used when the target and independent features described by a linear or non-linear relationships between each other, and the target variable contains continuous values. The regression technique gets used mainly to determine the predictor strength, forecast trends, time series, and sometimes in case of cause & effect relation.
Regression analysis is the basic technique to solve the regression problems in machine learning ML using data models. It consists of determining the best fit line, which is a line that passes through all the data points in such a way that distance of the line from each data point is optimal/minimized.
Importing the libraries
library(caTools)
Importing the dataset
dataset=read.csv('Data.csv')
Splitting the dataset into the Training set and Test set
set.seed(123)
split = sample.split(dataset$DependentVariable, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Taking care of missing data
dataset$feature = ifelse(is.na(dataset$feature),
ave(dataset$feature, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$feature)
dataset$target = ifelse(is.na(dataset$target),
ave(dataset$target, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$target)
Encoding categorical data
dataset$feature2 = factor(dataset$feature2,
levels = c('Option1, 'Option2', 'Option3),
labels = c(1, 2, 3))
dataset$feature3 = factor(dataset$feature3,
levels = c('No', 'Yes'),
labels = c(0, 1))
Feature Scaling
training_set[,2:3]=scale(training_set[,2:3])
test_set[,2:3]=scale(test_set[,2:3])
Fitting Simple Linear Regression to the Training set in R
regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)
Getting information about our model in R
summary(regressor)
Predicting the Test set results in R
y_pred = predict(regressor, newdata = test_set)
Installing visualization library in RStudio
install.packages('ggplot2')
Visualising the Training set results in RStudio
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$feature, y = training_set$target),
colour = 'red') +
geom_line(aes(x = training_set$feature, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Target vs feature (Training set)') +
xlab('feature') +
ylab('target')
Visualising the Test set results in RStudio
ibrary(ggplot2)
ggplot() +
geom_point(aes(x = test_set$feature, y = test_set$target),
colour = 'red') +
geom_line(aes(x = training_set$feature, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Target vs feature (Test set)') +
xlab('feature') +
ylab('target')
Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = Target ~ .,
data = training_set)
Features filtering using Backward Elimination technique
regressor = lm(formula = Target ~ Feature1 + Feature2 + Feature3 + Feature4,
data = dataset)
summary(regressor)
# Then remove the feature with highest P-value if it is higher than significance level e.g 5% and fit regressor again.
Training the Polynomial Regression model on the whole dataset in RStudio
# first creating new polynomial features
dataset$feature2 = dataset$feature^2
dataset$feature3 = dataset$feature^3
dataset$feature4 = dataset$feature^4
# training the model and displaying results
poly_reg = lm(formula = Target ~ .,
data = dataset)
summary(poly_reg)
Predicting a new single result with Linear Regression- single feature
lin_reg = lm(formula = Target ~ .,
data = dataset)
predict(lin_reg, data.frame(feature = put_here_any_value))
Predicting a new single result with Polynomial Regression- single feature
predict(poly_reg, data.frame(feature = put_here_any_value,
feature2 = put_here_any_value,
feature3 = put_here_any_value^3,
feature4 = put_here_any_value^4))
#Importing the dataset
dataset = read.csv('my_dataset.csv')
dataset = dataset[2:3]
# Fitting SVR to the dataset
#install.packages('e1071')
library(e1071)
regressor = svm(formula = Target ~ .,
data = dataset,
type = 'eps-regression',
kernel = 'radial')
# Predicting a new result with SVR - single feature
y_pred = predict(regressor, data.frame(Feature = put_here_any_value))
# Visualising the SVR results
# install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = dataset$Feature, y = dataset$Target),
colour = 'red') +
geom_line(aes(x = dataset$Feature, y = predict(regressor, newdata = dataset)),
colour = 'blue') +
ggtitle('Target vs Feature (SVR)') +
xlab('Feature') +
ylab('Target')
#Importing the dataset
dataset = read.csv('my_dataset.csv')
dataset = dataset[2:3]
# Fitting Decision Tree Regression to the dataset
#install.packages('rpart')
library(rpart)
regressor = rpart(formula = Target ~ .,
data = dataset,
control = rpart.control(minsplit = 1))
# Predicting a new result with Decision Tree Regression- single feature
y_pred = predict(regressor, data.frame(Feature = put_here_any_value))
# Visualising the Decision Tree Regression results with high resolution
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Feature), max(dataset$Feature), 0.01)
ggplot() +
geom_point(aes(x = dataset$Feature, y = dataset$Target),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Feature = x_grid))),
colour = 'blue') +
ggtitle('Target vs Feature (Decision Tree Regression)') +
xlab('Feature') +
ylab('Target')
# Plotting the tree
plot(regressor)
text(regressor)
# #The only difference in code from Decision Tree Regression above is Training the Random Forest Regression model, on the whole datase (without split) in this example
install.packages('randomForest')
library(randomForest)
set.seed(1234)
regressor = randomForest(x = dataset[1],
y = dataset$Target,
ntree = 500)
Model: Linear Regression.
Pros: Works on any size of dataset, gives informations about relevance of features.
Cons: Linear Regression Assumptions.
Model: Polynomial Regression.
Pros: Works on any size of dataset, works very well on non linear problems.
Cons: Needed to choose the right polynomial degree for a good bias/variance tradeoff.
Model: SVR.
Pros: Easily adaptable, works very well on non linear problems, not biased by outliers.
Cons: Compulsory to apply feature scaling, difficult interpretations.
Model: Decision Tree Regression.
Pros: Interpretability, no need for feature scaling, works on both linear / nonlinear problems.
Cons: Poor results on too small datasets, overfitting can easily occur.
Model: Random Forest Regression.
Pros: Powerful and accurate, good performance on many problems, including non linear.
Cons: No interpretability, overfitting can easily occur, needed to choose the number of trees.
dataset$feature2 = factor(dataset$feature2,
levels = c('Option1, 'Option2', 'Option3),
labels = c(1, 2, 3))
dataset$feature3 = factor(dataset$feature3,
levels = c('No', 'Yes'),
labels = c(0, 1))
Feature Scaling
training_set[,2:3]=scale(training_set[,2:3])
test_set[,2:3]=scale(test_set[,2:3])
Fitting Simple Linear Regression to the Training set in R
regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)
Getting information about our model in R
summary(regressor)
Predicting the Test set results in R
y_pred = predict(regressor, newdata = test_set)
Installing visualization library in RStudio
install.packages('ggplot2')
Visualising the Training set results in RStudio
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$feature, y = training_set$target),
colour = 'red') +
geom_line(aes(x = training_set$feature, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Target vs feature (Training set)') +
xlab('feature') +
ylab('target')
Visualising the Test set results in RStudio
ibrary(ggplot2)
ggplot() +
geom_point(aes(x = test_set$feature, y = test_set$target),
colour = 'red') +
geom_line(aes(x = training_set$feature, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Target vs feature (Test set)') +
xlab('feature') +
ylab('target')
Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = Target ~ .,
data = training_set)
Features filtering using Backward Elimination technique
regressor = lm(formula = Target ~ Feature1 + Feature2 + Feature3 + Feature4,
data = dataset)
summary(regressor)
# Then remove the feature with highest P-value if it is higher than significance level e.g 5% and fit regressor again.
Training the Polynomial Regression model on the whole dataset in RStudio
# first creating new polynomial features
dataset$feature2 = dataset$feature^2
dataset$feature3 = dataset$feature^3
dataset$feature4 = dataset$feature^4
# training the model and displaying results
poly_reg = lm(formula = Target ~ .,
data = dataset)
summary(poly_reg)
Predicting a new single result with Linear Regression- single feature
lin_reg = lm(formula = Target ~ .,
data = dataset)
predict(lin_reg, data.frame(feature = put_here_any_value))
Predicting a new single result with Polynomial Regression- single feature
predict(poly_reg, data.frame(feature = put_here_any_value,
feature2 = put_here_any_value,
feature3 = put_here_any_value^3,
feature4 = put_here_any_value^4))
#Importing the dataset
dataset = read.csv('my_dataset.csv')
dataset = dataset[2:3]
# Fitting SVR to the dataset
#install.packages('e1071')
library(e1071)
regressor = svm(formula = Target ~ .,
data = dataset,
type = 'eps-regression',
kernel = 'radial')
# Predicting a new result with SVR - single feature
y_pred = predict(regressor, data.frame(Feature = put_here_any_value))
# Visualising the SVR results
# install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = dataset$Feature, y = dataset$Target),
colour = 'red') +
geom_line(aes(x = dataset$Feature, y = predict(regressor, newdata = dataset)),
colour = 'blue') +
ggtitle('Target vs Feature (SVR)') +
xlab('Feature') +
ylab('Target')
#Importing the dataset
dataset = read.csv('my_dataset.csv')
dataset = dataset[2:3]
# Fitting Decision Tree Regression to the dataset
#install.packages('rpart')
library(rpart)
regressor = rpart(formula = Target ~ .,
data = dataset,
control = rpart.control(minsplit = 1))
# Predicting a new result with Decision Tree Regression- single feature
y_pred = predict(regressor, data.frame(Feature = put_here_any_value))
# Visualising the Decision Tree Regression results with high resolution
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Feature), max(dataset$Feature), 0.01)
ggplot() +
geom_point(aes(x = dataset$Feature, y = dataset$Target),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Feature = x_grid))),
colour = 'blue') +
ggtitle('Target vs Feature (Decision Tree Regression)') +
xlab('Feature') +
ylab('Target')
# Plotting the tree
plot(regressor)
text(regressor)
# #The only difference in code from Decision Tree Regression above is Training the Random Forest Regression model, on the whole datase (without split) in this example
install.packages('randomForest')
library(randomForest)
set.seed(1234)
regressor = randomForest(x = dataset[1],
y = dataset$Target,
ntree = 500)
Model: Linear Regression.
Pros: Works on any size of dataset, gives informations about relevance of features.
Cons: Linear Regression Assumptions.
Model: Polynomial Regression.
Pros: Works on any size of dataset, works very well on non linear problems.
Cons: Needed to choose the right polynomial degree for a good bias/variance tradeoff.
Model: SVR.
Pros: Easily adaptable, works very well on non linear problems, not biased by outliers.
Cons: Compulsory to apply feature scaling, difficult interpretations.
Model: Decision Tree Regression.
Pros: Interpretability, no need for feature scaling, works on both linear / nonlinear problems.
Cons: Poor results on too small datasets, overfitting can easily occur.
Model: Random Forest Regression.
Pros: Powerful and accurate, good performance on many problems, including non linear.
Cons: No interpretability, overfitting can easily occur, needed to choose the number of trees.