Two-sample testing in statistics.
The two-sample t-test (or independent samples t-test) is one of the most commonly used hypothesis tests which applied to compare whether the average difference between two groups is really significant.
Two-sample means that we have 2 sets of samples.
The formula itself used in Python stats library may differ depending if two data groups are paired or unpaired, with equal or unequal varianse and equal or unequal sample sizes.
So chosing correct formula depending on tested data nature is important. But surely 1 common part of each formula is nemerator which is a data groups means differense.
Paired means that both samples consist of the same test subjects, e g testing group of students before and after taking drugs. Unpaired means that both samples consist of distinct test subjects, e g testing group of students taking drugs and reference group of students. It is a common assuption that if we have the ratio of the larger variance to the smaller variance less than 4, we can assume the variances are approximately equal.
Generate the data for two-sample t-test:
import matplotlib.pyplot as plt import numpy as np import scipy.stats as stats # parameters n1 = 30 # samples in dataset 1 n2 = 40 # ...and 2 mu1 = 1 # population mean in dataset 1 mu2 = 1.2 # population mean in dataset 2 # generate the data data1 = mu1 + np.random.randn(n1) data2 = mu2 + np.random.randn(n2) # show their histograms plt.hist(data1,bins='fd',color=[1,0,0,.5],label='Data 1') plt.hist(data2,bins='fd',color=[0,0,1,.5],label='Data 2') plt.xlabel('Data value') plt.ylabel('Count') plt.legend() plt.show()
T-test using the Python scipy library:
t,p = stats.ttest_ind(data1,data2,equal_var=True) df = n1+n2-2 print('t(%g) = %g, p=%g'%(df,t,p))
OUT: t(68) = 0.0974228, p=0.922677
T-values depending means difference and variance:
# ranges for t-value parameters meandiffs = np.linspace(-3,3,80) pooledvar = np.linspace(.5,4,100) # group sample size n1 = 40 n2 = 30 # initialize output matrix allTvals = np.zeros((len(meandiffs),len(pooledvar))) # loop over the parameters... for meani in range(len(meandiffs)): for vari in range(len(pooledvar)): # t-value denominator df = n1 + n2 - 2 s = np.sqrt(( (n1-1)*pooledvar[vari] + (n2-1)*pooledvar[vari]) / df) t_den = s * np.sqrt(1/n1 + 1/n2) # t-value in the matrix allTvals[meani,vari] = meandiffs[meani] / t_den plt.imshow(allTvals,vmin=-4,vmax=4,extent=[pooledvar,pooledvar[-1],meandiffs,meandiffs[-1]],aspect='auto') plt.xlabel('Variance') plt.ylabel('Mean differences') plt.colorbar() plt.title('t-values as a function of difference and variance') plt.show()