Data distribution in statistics.


In a world of statistics a data distribution is a function which presents all the possible values of the data. It also shows how often each value occurs.
From this distribution it is possible to calculate the probability of any one particular observation in the sample space, or the likelihood that an observation will have a value which is less than (or greater than) a point of interest.

Visualizing statictical distributions.


The function of a distribution that presents the density of the values of our data is called a probability density function or simply pdf.




Gaussian (normal) distribution:


Gaussian distribution (aka normal distribution) is a bell-shaped curve, and it is assumed that during any measurement values will follow a normal distribution with an equal number of measurements above and below the mean value.



import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

# number of discretizations
N = 1001

x = np.linspace(-4,4,N)
gausdist = stats.norm.pdf(x)

plt.plot(x,gausdist)
plt.title('Analytic Gaussian (normal) distribution')
plt.show()

print(sum(gausdist))       

Gaussian normal distribution


Uniform distribution:


In statistics, uniform distribution refers to a type of probability distribution in which all outcomes are equally likely. A deck of cards is a great example as the likelihood of drawing a heart, a club, a diamond, or a spade is equally likely.



import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

# parameters
stretch = 2 # not the variance
shift   = .5
n       = 10000

# create data
data = stretch*np.random.rand(n) + shift-stretch/2

# plot data
fig,ax = plt.subplots(2,1,figsize=(5,6))

ax[0].plot(data,'.',markersize=1)
ax[0].set_title('Uniform data values')

ax[1].hist(data,25)
ax[1].set_title('Uniform data histogram')

plt.show()

Uniform distribution



Log-normal distribution:


A random variable is lognormally distributed if its logarithm is normally distributed. It plays an important role in probabilistic design because negative values of engineering phenomena are sometimes physically impossible.



import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

N = 1001
x = np.linspace(0,10,N)
lognormdist = stats.lognorm.pdf(x,1)

plt.plot(x,lognormdist)
plt.title('Analytic log-normal distribution')
plt.show()

Log-normal distribution


Binomial distribution:


A good example of binomial distribution is the probability of K heads in N coin tosses, given a probability of p heads (e.g., .5 is a fair coin).



import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

n = 10 # number on coin tosses
p = .5 # probability of heads

x = range(n+2)
bindist = stats.binom.pmf(x,n,p)

plt.bar(x,bindist)
plt.title('Binomial distribution (n=%s, p=%g)'%(n,p))
plt.show()

Binomial distribution


F distribution:


The F-distribution is a way of obtaining the probabilities of specific sets of events. The F-statistic is often used to evaluate the significant difference of a theoretical data models.



import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

# parameters
num_df = 5   # numerator degrees of freedom
den_df = 100 # denominator df

# values to evaluate 
x = np.linspace(0,10,10001)

# the distribution
fdist = stats.f.pdf(x,num_df,den_df)

plt.plot(x,fdist)
plt.title(f'F({num_df},{den_df}) distribution')
plt.xlabel('F value')
plt.show()

F distribution


T distribution:


The T distribution or a Student’s t-distribution, is a type of probability distribution which is similar to the normal distribution with its bell shape but has heavier tails.



import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

x  = np.linspace(-4,4,1001)
df = 200
t  = stats.t.pdf(x,df)

plt.plot(x,t)
plt.xlabel('t-value')
plt.ylabel('P(t | H$_0$)')
plt.title('t(%g) distribution'%df)
plt.show()

T distribution


Checking distribution type:


You can check distribution type using Fitter library (don't forget to pip install it) and simple code below. In this example we generate random normal distribution using numpy, Fitter will check your data against different distributions, about 100 at the moment I test my code. If the test takes a long time, you can reduse the number of tested templates - f = Fitter(data, distributions=['gamma', 'rayleigh', 'uniform', 'normal', 'student', 'gennorm']).



from fitter import Fitter
import numpy as np

data = np.random.normal(size=(1000))

f=Fitter(data)
f.fit()
f.summary()

checking distribution



See also related topics: