Outliers cleaning in statistics.

Another statistical outliers removalmethod in adiition to already described above is using Euclidean distance multivariate analysis method. There are various distance metrics, scores and techniques to detect outliers.

Euclidean distance outliers removal.

Euclidean distance is one of the most known distance metrics to identify outliers based on their distance to the center point. When Euclidean distances for X variables (usually 2) are calculated, the classical Z method then applied.

Creating data, calculating distancies and plotting:

import numpy as np
import matplotlib.pyplot as plt

## creating data

N = 40

# two-dimensional data
d1 = np.exp(-abs(np.random.randn(N)*3))
d2 = np.exp(-abs(np.random.randn(N)*5))
datamean = [ np.mean(d1), np.mean(d2) ]

# compute distance of each point to the mean
ds = np.zeros(N)
for i in range(N):
    ds[i] = np.sqrt( (d1[i]-datamean[0])**2 + (d2[i]-datamean[1])**2 )

# convert to z (don't need the original data)
ds = (ds-np.mean(ds)) / np.std(ds)

# plot the data
fig,ax = plt.subplots(1,2,figsize=(8,6))

ax[0].set_xlabel('Variable x')
ax[0].set_ylabel('Variable y')

# plot the multivariate mean

# then plot those distances
ax[1].plot(ds,'ko',markerfacecolor=[.7, .5, .3],markersize=12)
ax[1].set_xlabel('Data index')
ax[1].set_ylabel('Z distance')

Creating data, calculating distancies and plotting

Identifying the outliers:

IMPORTANT !: choose the threshold wisely depending on data (usually between 3 and 2).

# threshold in standard deviation units
distanceThresh = 2.5

# find the offending points
oidx = np.where(ds>distanceThresh)[0]


# and cross those out


Identifying the outliers using euclidean distance

See also related topics: