Population variance, sample variance and sample size

A story about estimators

What’s the relationship between population variance, sample variance and sample size? This question is pretty broad, and opens up to a lot of interesting issues that have challenged my understanding of the relationship between sample size, variance and bias of estimators.

Let’s start with a sample of N observations all generated by the same process. We assume they are iid Gaussian with some mean μ and variance σ2 (they are ~ N(μ, σ2)).
Now, if we ask ourselves whether increasing sample size (N) decreases population variance, well the answer is simple and neat: no! Population variance is a parameter, a deterministic number whose value does not depend on sample size.
We may ask ourselves whether adding more data decreases the value of the estimator of the population variance. We can work out the answer.
So, we want to get an idea about how much observations like those in our hands fluctuate around their mean (i.e., we want to estimate σ2). We do this using an estimator for the variance, which, as usually in statistics, is built replacing expectations by averages. Wait, what is an expectation? If you are already familiar with concept of expectation, well, keep reading. Otherwise have a look at appendix 1 at the bottom of the post. Why do we need to replace expectations by averages to get the variance?? Because the true, unknown population variance is computed as the expectation of the squared differences between a random variable and its expected value. Something like the following:

E[(XE[X])2]

where X is a random variable. For discrete random variables, this translates to the sum of the product of the squared difference between the random variable and its mean times the probability of the occurrence of the random variable (see appendix 1 for continuous random variables).
Back to our variance estimator, we take the formula of the true, unknown variance, we replace expectations by averages and we are done. We have something like:

1nΣ(XX)2

where X is the average (an estimator of the expected value of the random variable X). A side note about why we square differences between observations and their average. Summing up negative and positive values would not be a good idea, as they offset. Taking squares of the distances not only solves the issue, but has also the side-effect of giving more weight to large differences. We have our estimator of the variance! We compute 1nΣ(XX)2 and now we are curious to know what happens if we average over an increasing number of observations? Will the value of our estimator continuously decrease as we add more and more observations? The answer is still no! In the context of estimating σ2, the idea that adding more observations leads to a lower variance is mistaken. Then, what happens to our estimator if we add more observations?
To answer this question we need to play a bit with assumptions, distributions and expectations.
First, let’s call our variance estimator s2. Following Cochran’s theorem:

ns2σ2=χn12

What does this mean? Let’s focus on the LHS of the equation first. The two n(s) (one multiplies s2, the other one is hidden and it’s the 1n that we use to average the sum of squares of the differences between observations and their average) cancel out. So, we are left with something similar to a sum of Z2 random variables (remember we assumed a Gaussian generating process):

1σ2Σ(XX)2=Σ((XX)σ)2

I just said “something similar to Z2” as μ is replaced by an average. Nonetheless, Cochran’s theorem tells us that this sum of weird Z2 is still χ2 distributed, but with a number of degrees of freedom that is equal to n1 instead of the n weird Z2. The χ2 distribution is nice, as its expected value (see above) is equal to its degrees of freedom and its variance is twice its degrees of freedom. Why did we go through all this? Patience, we are almost there! Now, let’s take the expectation of both sides of the above equation:

E(ns2σ2)=E(χn12)

There’s a rule that says that constants come out as they are from expectations (recall that σ2} is a fixed, deterministic number, so essentially a constant). So, we can re-write this last equation as:

nσ2E(s2)=E(χn12)

Let’s move nσ2 to the RHS and take the expectation of χn12. We are left with:

E(s2)=n1nσ2

Ask ourselves again what happens if we estimate σ2 with an increasing number of observations. As N increases the ratio n1n goes to 1 and the expected value of our estimator s2 converges to the parameter! This means that as we get more data, the estimator starts fluctuating more and more around the true, but unknown variance (see appendix 2 at the bottom of this page to know how to compute the bias of s2).

Now, let’s work out the variance of s2. Yes, we are dealing with the variance of the estimator of the variance. No surprise here: an estimator is essentially a random variable and, as such, has its own variance (aka sampling variance). We can derive the variance of s2 as we did for its expectation:

Var(ns2σ2)=Var(χn12)

We apply another rule that says that constants come out squared from variances. Also, we use the fact that the variance of a χ2 distributed random variable is equal to 2 times its degrees of freedom. We are left with:

n2(σ2)2Var(s2)=2(n1)

So, the variance of s2 is:

Var(s2)=2(n1)n2σ4

In the end, we found something whose value decreases as N increases! Indeed, as we add more observations, the variance of s2 decreases! This makes intuitively sense, as it means that we are able to estimate σ2 with increasing precision. The more data we get, the more precisely we can estimate the population variance (and the less biased is our estimator, see above)!

To sum up, when N increases the value of the estimator converges to the true parameter (so it doesn’t decrease), but the variance of the estimator does decrease!

Now that we are done, we can check what happens if we use another estimator of σ2, one that promises to be unbiased (sunb2, where unb stays for unbiased)! We all know it, as we are used to rely on it:

1n1Σ(XX)2

This other formula takes into account that we had to estimate X first. So we are left with n1 degrees of freedom for estimating the variance. Let’s derive the expected value and the variance of this estimator:

E(n1sunb2σ2)=E(χn12)

Following what we did above:

E(sunb2)=n1n1σ2=σ2

We see that the expected value of this estimator does not depend on N. The estimator will always fluctuate about the σ2.
About its variance:

Var(n1sunb2σ2)=Var(χn12)

Which leads to:

Var(sunb2)=2(n1)(n1)2σ4=2(n1)σ4

As for s2, the variance of sunb2 decreases as N increases.

To visually represent what I have been writing about so far, just look at the gif below. On the left, we see what happens to sunb2 as more and more data are involved in the computation of the estimator. The value of sunb2 always fluctuates about the true σ2 (black solid line), but its variance (how much it fluctuates about the parameter) decreases as N increases. The right panel shows how the standard error (the standard deviation of the sampling distribution) of averages decreases as N increases. This is an old gif. At that time I was curious to see how the variance of averages would decrease at increasing sample size. But the idea equally applies to σ2. Notice that for a given N, the variance of averages can be computed as:

Var[1nΣX]=1n2Var[ΣX]=1n2nσ2=σ2n

Here, we use the fact that, under independence, the variance of the sum of n identically distributed random variables is equal to n times their variance. So, the variance of averages depends on n.

Appendix 1 - Expectation (E)

An expectation is usually denoted by the letter E and represents the mean score that a random variable would assume if we were to repeat an experiment an indefinitely large amount of times (sounds a lot like frequentist approach). We usually repeat an experiment a limited number of times, so, unluckily, we will never be able to get to the expectation. For the moment, we are fine with knowing that the expectation of a random variable is computed as the sum of the product of the value of the random variable times its probability of occurrence (replace sum by integral and probability by density and you get how to compute the expectation of a continuous random variable, sums and probabilities are for discrete cases).

Appendix 2 - Derive bias of s2

The bias of an estimator is computed as:

Bias=E(θθ)

Basically, is the expected difference between the estimator and the parameter. Imagine to compute a large number of estimates (e.g., values of a number of averages) and sum up the difference between these estimates and the parameter (note that this is possible only if we know the value of the parameter as it is the case in simulations). We are not interested about how much the estimator fluctuates around the parameter. We want to know how much, on average, the estimator is far from the parameter. Hence, this time it makes sense to sum up positive and negative differences and let them offset.
Specifically for s2, we have that:

Bias=E(s2σ2)=E(s2)σ2=n1nσ2σ2=σ2n

We used another rule for expectations here: the expectation of a sum is equal to the sum of expectations.
Deriving the bias of s2 confirms what we saw above. In short, s2 is downwardly biased, but its bias decreases as N increases.

By learning to do the easy things the hard way, the hard things will become easy (Royale & Dorazio).