Tuesday, March 03, 2009

Introduction to the Percentile Bootstrap

Introduction

This article introduces the percentile bootstrap, the simplest of the bootstrap methods. The bootstrap family of techniques are used to establish confidence intervals and calculate hypothesis tests for statistical measures.

Problem Statement and Conventional Solution

One is often required to summarize a set of data, such as the following:


X =

2
10
10
5
8
1
4
9
8
10



The most commonly used summary is the mean, in MATLAB calculated thus:

>> mean(X)

ans =

6.7000




Summaries, however, discard a great deal of information. In any situation, it is helpful to know the quality of our summary. In the case above, we may wonder how far our sample mean is likely to be the true population mean (the mean of all numbers drawn from the theoretical statistical population). Our sample mean, after all, was calculated from only 10 observations.

We may establish some idea of how far off the sample mean may be from the population mean by calculating the standard error of the sample mean, which is the standard deviation divided by the square root of the sample size. In MATLAB:

>> StandardError = std(X) / sqrt(length(X))

StandardError =

1.0858


Note that there are fancier versions of this calculation, for example for cases in which the population size is finite, or when the standard error of a proportion is being calculated. The standard error gives us an estimate of how far away our sample error might be form the true population mean, and acts like a z-score: the population mean is within 2 times the standard error from the sample mean about 95% of the time. In the case of our little data set, this would be from 4.5285 (= 6.7000 - 2 * 1.0858) to 8.8715 (= 6.7000 + 2 * 1.0858).

Note that as the number of observations grows, the bottom part of the standard error fraction becomes larger and the standard error decreases. This seems natural enough: with more data, our confidence in our statistic increases.


Complications of the Problem Statement

So far, so good: We may have had to look up the standard error formula in a book, but we have established some sort of parameters as to the certainty of our summary. What if we didn't have such a reference, though? The median for example, has no such simple formula to establish its certainty. (Actually, I believe there is a formula for the median, but that it is a real bear!) Anyway, there certainly are other measures which we may calculate (even ones which we invent on the spot), for which there are no handy standard error formulas. What to do?


An Alternative Solution: The Bootstrap

Just as we are about to throw up our hands and consider another career, the bootstrap appears. The basic method of the bootstrap is simple: Draw many samples with replacement from the original sample ("replicates"), and tabulate the summary statistic when calculated on each those replicate samples. The distribution of those replicated summaries is intended to mimic the distribution being parameterized by the standard error of the mean.

Above, I mentioned that the population mean would be found inside the band from the sample mean minus two times the standard error to the sample mean plus two times the standard error about 95% of the time. The equivalent area in our bootstrap process would be between the 2.5 and 97.5 percentiles of our replicate summaries. We use 2.5 and 97.5 because that leaves a total of 5% outside of the range, half on each end of the spectrum.

An example using the median will illustrate this process. For reference, let's calculate the sample median first:

>> median(X)

ans =

8


Drawing a single sample with replacement can be done in MATLAB by indexing using random integers:

RandomSampleWithReplacement =

5
8
1
1
10
10
9
9
5
1


This is our first bootstrap replicate. Now, we calculate our summary on this replicate:

>> median(RandomSampleWithReplacement)

ans =

6.5000


To discern the distribution, though, will require many more replicates. Since the computer is doing all of the work, I generally like to run at least 2000 replicates to give the bootstrap distribution a chance to take shape:

rand('twister',1242) % Seed the random number generator for repeatability
T = NaN(2000,1); % Allocate space for the replicated summaries
for i = 1:2000 % The machine's doing the work, so why not?
RandomSampleWithReplacement = X(ceil(length(X) * rand(length(X),1))); % Draw a sample with replacement
T(i) = median(RandomSampleWithReplacement); % Calculate the replicated summary
end


(I apologize if the code is a bit cramped, but I have not been able to figure out how to insert tabs or indentation in this edit window.)

Now, estimating where the "real" median (the population median) is likely to be is a simple matter of checking percentiles in our replicated summaries. I have the Statistic Toolbox, so I will cheat by using a function from there:

>> prctile(T,[2.5 97.5])

ans =

3.5000 10.0000


So, our population median is likely to lie between 3.5 and 10. That is a pretty wide range, but this is the consequence of having so little data.


Wrap-Up

The fundamental trade-off of the bootstrap is that one forsakes pat statistical formulas in favor of strenuous computation. In summary:

Good:

-The bootstrap solves many problems not amenable to conventional methods.

-Even in cases where conventional solutions exist, the bootstrap requires no memory or selection of correct formulas for given situations.

Bad:
-The bootstrap requires considerable numerical computation. Of course, in an era of cheap and powerful computing machinery, this is much less of an issue. Still, if there are many of these to perform...

-The bootstrap presented in this article, the bootstrap percentile, is known to deviate from theoretically correct answers, though generally in a small way. There are more sophisticated bootstrap procedures which address some of these concerns, though.

This process, owing to its very general nature, can be applied to tasks much more complex than estimating the uncertainty of statistical summaries, such as hypothesis testing and predictive model performance evaluation.


Further Reading

An Introduction to the Bootstrap by Efron and Tibshirani (ISBN 0-412-04231-2)
This is the seminal work in this field. It covers a lot of ground, but is a bit mathematical.

4 comments:

Anonymous said...

I've taken some statistics in my day. I've heard of, but never used, the bootstrap approach. Very interesting, and amenable to an Excel solution.

I put your data into cells B10:K10, and named this range 'Data'. In B11:K11 I put the formula
=INDEX(Data,INT(RAND()*10+1))
which essentially samples with replacement from Data. In M10:P10 I inserted formulas for the mean, stdev, stderr, and median of Data, and copied these formulas to row 11 to compute the same for my first replicate. Finally I copied my first replicate to fill down to row 2010, giving me 2000 replicates.

Every time I press F9 I get a new set of replicates. I found the 2.5th precentile of the median to range between 3 and 4 (mostly 3.5), the 50th percentile always 8, and the 97.5th percentile always 10.

Would this work to estimate the population mean as well? I found the 2.5th, 50th, and 97.5th percentiles to be "around" 4.5, 6.7 or 6.8, and around 8.5.

My 240KB workbook is posted at
http://peltiertech.com/images/2009-03/Bootstrap.zip

Anonymous said...

P.S. Thanks for reading and commenting on my blog.

Will Dwinnell said...

Would this work to estimate the population mean as well?

When calculating the bootstrap, the replicate interval (percentile 2.5 to percentile 97.5, for instance) is intended to bracket the population summary. I believe that in most cases, the original sample summary itself is the best point estimate of the population summary.

Dean Abbott said...

One reason I like the bootstrap is that it is easy to program in even when your tool doesn't do it directly. Jon gives an Excel example, and I've done the comparable thing in Clementine and Insightful Miner in the past (in these cases, it is especially easy to "select" the samples through an inner join, and then select the "out of sample" data via an anti-join.