Permutation testing in statistics.
Permutation tests (also known as re-randomization test) are non-parametric tests that used when we don’t know much about data nature. The theoretial difference between permutation tests and inferential tests is that with permutation tests we build the sampling distribution from the observed data, rather than infering or assuming that a sampling distribution exist.
The distribution of the test statistic under the null hypothesis is made by calculating possible values of the test statistic under all possible rearrangements of the observed data points. So permutation test can be considered as a form of artificial resampling.
The method simply generates the distribution of mean differences under the assumption that the two groups are not distinct in terms of the measured variable.
Permutation test are consedered as an alternative to classical two-sample t-test.
Generating initial data for permutation testing:
import matplotlib.pyplot as plt import numpy as np import scipy.stats as stats # number of trials N = 100 # dataset "A" r = np.random.randn(N) r[r>0] = np.log(1+r[r>0]) dataA = 26-r*10 # get histogram values for later comparison yA,xA = np.histogram(dataA,20) xA = (xA[:-1]+xA[1:])/2 # dataset "B" r = np.random.randn(N) r[r>0] = np.log(1+r[r>0]) dataB = 30-r*10 #get histogram values for later comparison yB,xB = np.histogram(dataB,20) xB = (xB[:-1]+xB[1:])/2 plt.stem(xA,yA,'b',markerfmt='bo',basefmt=' ',label='Data"A"') plt.stem(xB,yB,'r',markerfmt='ro',basefmt=' ',label='Data"B"') plt.legend() plt.show()
Generating null hypothesis scenario:
## mix datasets together # concatenate trials alldata = np.hstack((dataA,dataB)) # condition labels conds = np.hstack((np.ones(N),2*np.ones(N))) # random permutation fakeconds = np.random.permutation(N*2) # shuffled condition labels fakeconds[fakeconds<=N] = 1 fakeconds[fakeconds>1] = 2 # these two means should be different. print([np.mean(alldata[conds==1]), np.mean(alldata[conds==2])]) # should these two be different? print([np.mean(alldata[fakeconds==1]), np.mean(alldata[fakeconds==2])])
OUT: [27.54863852696174, 31.75392717918671]
Distribution of null hypothesis values:
nPerms = 1000 permdiffs = np.zeros(nPerms) for permi in range(nPerms): fconds = np.random.permutation(N*2) fconds[fconds
1] = 2 permdiffs[permi] = np.mean(alldata[fconds==2]) - np.mean(alldata[fconds==1]) # plot the distribution of H0 values plt.hist(permdiffs,50) # and plot the observed value on top obsval = np.mean(alldata[conds==2]) - np.mean(alldata[conds==1]) plt.plot([obsval, obsval],[0, 50],'m',linewidth=10) plt.xlabel('Value') plt.ylabel('Count') plt.show()
Two methods of evaluating statistical significance for permutation testing:
## # Z-value zVal = ( obsval-np.mean(permdiffs) ) / np.std(permdiffs,ddof=1) p = 1-stats.norm.cdf(abs(zVal)) # p-value count pCount = sum(permdiffs>obsval)/nPerms print(p,pCount)
OUT: 0.0008127353958568007 0.001