## Bootstrapping in statistics.

Bootstrapping is a statistical method or procedure that resamples a single dataset to create many new simulated samples.

Bootstrapping is sometimes called a resample with replacement method.

Bootstrapping VS analytical method for calculating confidence intervals.

PROS/Advantages:

- Can be calculated for any kinds of parameters - mean, variance, correlation, median, etc...

- Very handy for limited data, small number of experiments.

- Not based on assumptions of normality.

CONS/Disadvantages:

- Provides slightly different results for each calculation.

- Time consuming for large datasets.

- There is an assumption that a given sample has a good representation of the true population.

The code below gives an example of calculation confidence intervals using bootstrapping and compares results with analytical method.

## Generating initial data for CI calculation:

```
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
from matplotlib.patches import Polygon
## simulate data
popN = int(1e7) # lots and LOTS of data!!
# the data (note: non-normal!)
population = (4*np.random.randn(popN))**2
# we can calculate the exact population mean
popMean = np.mean(population)
# let's see it
fig,ax = plt.subplots(2,1,figsize=(6,4))
# only plot every 1000th sample
ax[0].plot(population[::1000],'k.')
ax[0].set_xlabel('Data index')
ax[0].set_ylabel('Data value')
ax[1].hist(population,bins='fd')
ax[1].set_ylabel('Count')
ax[1].set_xlabel('Data value')
plt.show()
```

## Drawing a random sample with confidence intervals - bootstrapping:

```
# parameters
samplesize = 40
confidence = 95 # in percent
# compute sample mean
randSamples = np.random.randint(0,popN,samplesize)
sampledata = population[randSamples]
samplemean = np.mean(population[randSamples])
samplestd = np.std(population[randSamples]) # used later for analytic solution
### now for bootstrapping
numBoots = 1000
bootmeans = np.zeros(numBoots)
# resample with replacement
for booti in range(numBoots):
bootmeans[booti] = np.mean( np.random.choice(sampledata,samplesize) )
# find confidence intervals
confint = [0,0] # initialize
confint[0] = np.percentile(bootmeans,(100-confidence)/2)
confint[1] = np.percentile(bootmeans,100-(100-confidence)/2)
## graph everything
fig,ax = plt.subplots(1,1)
# start with histogram of resampled means
y,x = np.histogram(bootmeans,40)
y = y/max(y)
x = (x[:-1]+x[1:])/2
ax.bar(x,y)
y = np.array([ [confint[0],0],[confint[1],0],[confint[1],1],[confint[0],1] ])
p = Polygon(y,facecolor='g',alpha=.3)
ax.add_patch(p)
# now add the lines
ax.plot([popMean,popMean],[0, 1.5],'k:',linewidth=2)
ax.plot([samplemean,samplemean],[0, 1],'r--',linewidth=3)
ax.set_xlim([popMean-30, popMean+30])
ax.set_yticks([])
ax.set_xlabel('Data values')
ax.legend(('True mean','Sample mean','%g%% CI region'%confidence,'Empirical dist.'))
plt.show()
```

## Confidence intervals - bootstrapping VS formula:

```
## compare against the analytic confidence interval
# compute confidence intervals
citmp = (1-confidence/100)/2
confint2 = samplemean + stats.t.ppf([citmp, 1-citmp],samplesize-1) * samplestd/np.sqrt(samplesize)
print('Empirical: %g - %g'%(confint[0],confint[1]))
print('Analytic: %g - %g'%(confint2[0],confint2[1]))
```

OUT:

Empirical: 10.6507 - 23.3089

Analytic: 9.96088 - 22.5929