Bootstrapping in statistics.

Bootstrapping is a statistical method or procedure that resamples a single dataset to create many new simulated samples.
Bootstrapping is sometimes called a resample with replacement method.
Bootstrapping VS analytical method for calculating confidence intervals.

Confidence intervals via bootstrapping.

- Can be calculated for any kinds of parameters - mean, variance, correlation, median, etc...
- Very handy for limited data, small number of experiments.
- Not based on assumptions of normality.

- Provides slightly different results for each calculation.
- Time consuming for large datasets.
- There is an assumption that a given sample has a good representation of the true population.

The code below gives an example of calculation confidence intervals using bootstrapping and compares results with analytical method.

Generating initial data for CI calculation:

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
from matplotlib.patches import Polygon

## simulate data

popN = int(1e7)  # lots and LOTS of data!!

# the data (note: non-normal!)
population = (4*np.random.randn(popN))**2

# we can calculate the exact population mean
popMean = np.mean(population)

# let's see it
fig,ax = plt.subplots(2,1,figsize=(6,4))

# only plot every 1000th sample
ax[0].set_xlabel('Data index')
ax[0].set_ylabel('Data value')

ax[1].set_xlabel('Data value')

Generating initial data for CI calculation

Drawing a random sample with confidence intervals - bootstrapping:

# parameters
samplesize = 40
confidence = 95 # in percent

# compute sample mean
randSamples = np.random.randint(0,popN,samplesize)
sampledata  = population[randSamples]
samplemean  = np.mean(population[randSamples])
samplestd   = np.std(population[randSamples]) # used later for analytic solution

### now for bootstrapping
numBoots  = 1000
bootmeans = np.zeros(numBoots)

# resample with replacement
for booti in range(numBoots):
    bootmeans[booti] = np.mean( np.random.choice(sampledata,samplesize) )

# find confidence intervals
confint = [0,0] # initialize
confint[0] = np.percentile(bootmeans,(100-confidence)/2)
confint[1] = np.percentile(bootmeans,100-(100-confidence)/2)

## graph everything
fig,ax = plt.subplots(1,1)

# start with histogram of resampled means
y,x = np.histogram(bootmeans,40)
y = y/max(y)
x = (x[:-1]+x[1:])/2,y)

y = np.array([ [confint[0],0],[confint[1],0],[confint[1],1],[confint[0],1] ])
p = Polygon(y,facecolor='g',alpha=.3)

# now add the lines
ax.plot([popMean,popMean],[0, 1.5],'k:',linewidth=2)
ax.plot([samplemean,samplemean],[0, 1],'r--',linewidth=3)
ax.set_xlim([popMean-30, popMean+30])
ax.set_xlabel('Data values')
ax.legend(('True mean','Sample mean','%g%% CI region'%confidence,'Empirical dist.'))

Drawing a random sample with confidence intervals - bootstrapping

Confidence intervals - bootstrapping VS formula:

## compare against the analytic confidence interval

# compute confidence intervals
citmp = (1-confidence/100)/2
confint2 = samplemean + stats.t.ppf([citmp, 1-citmp],samplesize-1) * samplestd/np.sqrt(samplesize)

print('Empirical: %g - %g'%(confint[0],confint[1]))
print('Analytic:  %g - %g'%(confint2[0],confint2[1]))

Empirical: 10.6507 - 23.3089
Analytic: 9.96088 - 22.5929

See also related topics: