Problems that Bug Me: The n-1 Correction for Sample Variance

Let me start with an easy example that everyone starts with in stats: we want to know how variable height is in the population of humans. We can’t measure the height of all humans so we gather up the few around us and here’s what we get.

Janet: 5ft
Sam: 6ft
Shaq: 7ft.

The average height of the three is: $\frac{5+6+7}{3}=6$ .

The variance of height is: $\frac{(5-6)^2+(6-6)^2+(7-6)^2}{(3-1)}=1$

So the statistician says we now have an estimate for average height and variance of heights of the population of humans: 6ft tall with a variance of 1.

Here’s a twist, though. If there were only three humans in existence, the variance would be different: $\frac{(5-6)^2+(6-6)^2+(7-6)^2}{3}=\frac{2}{3}$ .

So what’s the difference? The difference is that if Janet, Sam and Shaq are a sample of the population we divide the sum of squared differences from the mean by n-1=2. If they are the entire population, we divide it by n=3. This makes the sample variance larger than the population variance. Why is that?

This is a question that annoyed me for some time. The key, though, is to not focus on the variance calculation. The key is the mean.

You see, when you pull out a sample of a population you aren’t measuring the correct average height. Your height estimate is going to be wrong and it’s going to be different each time you draw a sample from the population. In other words, it’s going to vary and so increase the variance.

So the point of the n-1 correction is to increase the variance estimate to allow for the fact that your measurement of the mean varies with each possible sample.

If the sample gets big enough, dividing by n-1 isn’t much different than dividing by n. Imagine the difference between dividing by 100,000 or 100,001. So this correction becomes meaningless because the mean we are measuring is probably the correct one.

Here’s the math:

Let’s start with the sum of squared errors, which is this: $\sum\limits_{j=1}^k(Y_j-\bar{Y})^2$ . Remember that $\bar{Y}$ is the mean of the sample group. What we’re going to find is that it’s equal to this: $(k-1)\sigma^2$

which is the same thing as saying this $E[\frac{1}{(k-1)}\sum\limits_{j=1}^k(Y_j-\bar{Y})^2] = \sigma^2$ .

$\sum\limits_{j=1}^k(Y_j-\bar{Y})^2$ $= \sum\limits_{j=1}^k[(Y_j-\mu) + (\mu-\bar{Y})]^2$ . We start with one of the oldest tricks in the book. Adding and subtracting the same amount from the equation.

$= \sum\limits_{j=1}^k[(Y_j-\mu)^2 + 2(Y_j-\mu)(\mu-\bar{Y})+ (\mu-\bar{Y})^2]$ . Expand the square.

$= \sum\limits_{j=1}^k(Y_j-\mu)^2 + 2(\mu-\bar{Y})\sum\limits_{j=1}^k(Y_j-\mu)+\sum\limits_{j=1}^k(\bar{Y}-\mu)^2$ . Split up the summations.

$= \sum\limits_{j=1}^k(Y_j-\mu)^2 + 2(\mu-\bar{Y})(k\bar{Y}-k\mu)+k(\bar{Y}-\mu)^2$ Recognize that a sum of means is k*the mean.

$\sum\limits_{j=1}^k(Y_j-\bar{Y})^2$ $= \sum\limits_{j=1}^k(Y_j-\mu)^2 -k(\bar{Y}-\mu)^2$ . Simplify a bit. This is a pretty key step, actually, because now we see that the sum of squared error (the left hand side, which I’ve restated here for clarity) is smaller than the sample squared errors using the true mean, mu.

$E[\sum\limits_{j=1}^k(Y_j-\bar{Y})^2]$ = $E[\sum\limits_{j=1}^k(Y_j-\mu)^2 -k(\bar{Y}-\mu)^2]$ . Take the expectations. Boy, don’t those look like variances?