Sampling variability in statistics.
In real life and in the statistical world also it is almost impossible or not feasible to know all the details about the whole population.
In this case we deal with approximations of a smaller group (or sample) and hope that the answer we get isn’t too far from the truth.
Sampling variability is the difference between the measured value and the true value or parameter.
In other words sampling variability is the extent to which the measures of a sample differ from the measure of the population.
A measure that refers to a sample is called a statistic.
The parameter of a population never changes, but a statistic changes from sample to sample because there is always variation between samples. But in case you have enough samples, you generally get close to the population parameter. There is always variability in a measure and it comes from the fact that not every item in the sample is the same.
The sampling variability is also referred to as standard deviation or variance of a given data. It is used in several types of statistical tests for data analysis.
Theoretical distribution (population) and experiment data (sample):
import matplotlib.pyplot as plt import numpy as np import scipy.stats as stats ## a theoretical normal distribution x = np.linspace(-5,5,10101) theoNormDist = stats.norm.pdf(x) # (normalize to pdf) # theoNormDist = theoNormDist*np.mean(np.diff(x)) # now for our experiment numSamples = 40 # initialize sampledata = np.zeros(numSamples) # run the experiment! for expi in range(numSamples): sampledata[expi] = np.random.randn() # show the results plt.hist(sampledata,density=True) plt.plot(x,theoNormDist,'r',linewidth=3) plt.xlabel('Data values') plt.ylabel('Probability') plt.show()
Show the mean of samples of a known distribution:
# generate population data with known mean populationN = 1000000 population = np.random.randn(populationN) population = population - np.mean(population) # demean # now we draw a random sample from that population samplesize = 30 # the random indices to select from the population sampleidx = np.random.randint(0,populationN,samplesize) samplemean = np.mean(population[ sampleidx ]) ### how does the sample mean compare to the population mean? print(samplemean)
Sample means VS sample sizes:
samplesizes = np.arange(30,1000) samplemeans = np.zeros(len(samplesizes)) for sampi in range(len(samplesizes)): # nearly the same code as above sampleidx = np.random.randint(0,populationN,samplesizes[sampi]) samplemeans[sampi] = np.mean(population[ sampleidx ]) # show the results! plt.plot(samplesizes,samplemeans,'s-') plt.plot(samplesizes[[0,-1]],[np.mean(population),np.mean(population)],'r',linewidth=3) plt.xlabel('sample size') plt.ylabel('mean value') plt.legend(('Sample means','Population mean')) plt.show()