Bootstrap methods are powerful techniques used in non-parametric statistics, that means, whenever we are provided with data drawn from an unknown distribution law.

The underlying issue that bootstrap is meant to address is the well known problem of statistics: we want to collect information about a population, but we are provided only with a sample of this population. How can we be sure that this sample is representative of the whole population? Namely, if we compute the mean of our sample, does it well approximate the true mean of the population?

Moreover, we have to consider that, besides the statistics of means, standard deviations, Pearson coefficients etc., in a non-parametric framework the cumulative distribution function (CDF) is itself unknown, hence it automatically becomes one of the parameter to infer from our sample.

So, the idea of Bootstrap is that, instead of estimating our statistic only once, on the sample realization we obtained, we can do it many times on a re-sampling (with replacement) of the original sample. With this approach, repeated B times, we will obtain a vector of estimates of length B, of which we can compute the expected value, variance, empirical distribution and so forth.

So let’s proceed computing the bootstrap mean starting from a sample (X1,…,Xn) of independent and identically distributed random variables (we will hold this assumption for the whole procedure), drawn from a population according to an unknown distribution function F(x).

  • We first work out the empirical distribution function, which is given by:

Where 1 is the indicator function which takes value=1 if Xi is less than t, 0 otherwise. An example of how an empirical CDF might look like is the following:

As you can see, each step has the same size, equal to 1/n.

  • Then we draw, from our empirical CDF, a new sample of the same size of the original one. However, since each step of our empirical CDF is identical (1/n), sampling from the empirical CDF is the same as re-sampling (with replacement and equal probabilities) from the sample. We denote the re-sampled vector as (X*1, …, X*n).
  • We re-sample from the latter B times and, for each set of X*, we compute our statistic of interest (in our case, the sample mean). Note that this phase is applying MonteCarlo methods to Bootstrapping.
  • If n (sample size) and B (number of re-sampling) are sufficiently high, we can rely on the asymptotic properties of the summation of random variables (in particular, the Central Limit Theorem) and we can work out the distribution of our statistics. This latter is pivotal if we want to run Hypothesis tests about the likelihood of our statistic of being close to the value of the real parameter.

The following diagram might be useful to illustrate all the steps above:

Nice, now let’s implement it with Python.

For this purpose, I will generate a random vector which will be our population, whose law of distribution (expected to be the same as that of the population) is unknown. Then, I will pick a sample from our population and apply the Bootstrapping procedure:

np.random.seed(123)
pop = np.random.randint(0,500 , size=1000)
sample = np.random.choice(pop, size=300) #so n=300

Now I should compute the empirical CDF, so that I can sample from it. However, as we said above, sampling from empirical CDF is the same as re-sampling with replacement from our original sample, hence:

sample_mean = []
for _ in range(10000):  #so B=10000
    sample_n = np.random.choice(sample, size=300)
    sample_mean.append(sample_n.mean())

I basically created an empty list and, for each re-sampling of my initial sample, I appended its sample mean to that list. Now let’s have a look at the distribution and expected value of our vector of means (which is nothing but a random variable itself):

plt.hist(sample_mean)

The distribution seems normal, and this is exactly what we were expecting: because of the Central Limit Theorem when independent random variables are added, their sum tends toward a normal distribution even if the original variables themselves are not normally distributed.

Then, if we compare the true mean with the bootstrapped one we obtain:

np.mean(sample_mean)
255.73952966666664
pop.mean()
253.241

As you can see, it is pretty accurate. Plus, if we retrieve the mean of the original sample:

sample.mean()
241.07333333333332

We see how less accurate the latter is. Probably, the original sample drawn from the population was not that representative.

Bootstrap sampling is an important technique to bypass the non-parametric approach’s issues. Indeed, even though with a non-parametric approach we are “relaxing” some strict assumptions needed in case of a parametric framework, we pay this extra flexibility in terms of difficulty of estimating population’s features.

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: