ANOVA in statistics.


Analysis of variance (ANOVA) is a statistical method that is usually used to check if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples.

One-way ANOVA with Python.


TSS = SSE + RSS = Within_group_variability + Between_group_variability

Total Variation → TSS = Total Sum of Squares
Unexplained Variation → SSE = Sum of Squared Errors
Explained Variation → RSS = Regression Sum of Squares

The statistic which measures if the means of different samples are really (significantly) different or not is called the F-Ratio. Lower the F-Ratio, more similar are the sample means. In that case, we cannot reject the null hypothesis.If no true variance exists between the groups, the ANOVA's F-ratio should equal close to 1.

F = Between_group_variability / Within_group_variability

A one-way ANOVA lets us know that at least two groups are different from each other. But it won’t tell us which groups are different. If our test returns a significant f-statistic, we may need to run a post-hoc test to tell us exactly which groups have a difference in means.



One-way ANOVA - simulating data:



import numpy as np
import matplotlib.pyplot as plt
import pingouin as pg
import pandas as pd

## data parameters

# group means
mean1 = 4
mean2 = 3.8
mean3 = 7

# samples per group
N1 = 30
N2 = 35
N3 = 29

# standard deviation (assume common across groups)
stdev = 2
## now to simulate the data
data1 = mean1 + np.random.randn(N1)*stdev
data2 = mean2 + np.random.randn(N2)*stdev
data3 = mean3 + np.random.randn(N3)*stdev

datacolumn = np.hstack((data1,data2,data3))

# group labels
groups = ['1']*N1 + ['2']*N2 + ['3']*N3

# convert to a pandas dataframe
df = pd.DataFrame({'TheData':datacolumn,'Group':groups})
df

One-way ANOVA - simulating data


Generating the ANOVA table:



pg.anova(data=df,dv='TheData',between='Group')

Generating the ANOVA table


Performing the Tukey test:



pg.pairwise_tukey(data=df,dv='TheData',between='Group')

Performing the Tukey test


One-way ANOVA - visualizing the means differences:



df.boxplot('TheData',by='Group');

One-way ANOVA - visualizing the means differences




See also related topics: