Why use n-1 for the sample? — She asked

Simple visual demonstration to clear this notorious doubt!

Atul Sharma
Towards Data Science

--

Photo by OSPAN ALI on Unsplash

Before directly jumping to the topic, let us see the symbols that are used for population statistics and sample statistics respectively:

I will be referring to the same information from my previous blog for consistent interpretation:

Proceeding with the same data set of heights of 1000 males on Planet X (Total Population):

(Image by author)

As discussed in the demonstration activity of my previous blog on the Central Limit Theorem, where the estimation experts took samples from the population with their pre-decided sample sizes, each gave 10 estimations of the population mean. We had 5 self-proclaimed estimation experts to estimate the mean with varying sampling choice:

(Image by author)

Let’s take the example of the first estimation expert having a choice of a sample size of two. With his first sample, he estimated the sample mean to be 160 cm based on the mean of two derived values — 155 cm and 165 cm.

(Image by author)
(Image by author)

Notice the fact that the estimation expert doesn’t know the population mean all throughout his practice of estimation and his best guess of the population according to him is the mean of two observations he derived from the population. It is obvious to us as we are well aware of the population distribution and the population mean but according to him one best feasible position of the population mean is the mean of collected observations, also we are well aware that it is the mean around which the variance measure is minimum. So when we write the formula for variance calculation using N (as in population statistics), notice what is going wrong when we try to estimate it with sample data:

(Image by author)

Big Green bars reflect feasible population mean positions (for our data it is the big red bar-actual position). The point of the above representation is to put an idea in your mind that the population mean could be anywhere and the variance associated with it will always be greater than the variance estimated using the mean of the sample data by the estimation expert.

Notice how our actual variance has been under-estimated by the current formula.

Clarifying further, the Estimation expert gives the best estimate in the form of the mean of the sample data which again repeating is one of the feasible positions of the population mean. The population mean here (170) as we know is to the right of the estimated mean (160) but it can be anywhere if we were working with some other data set. Assuming the population mean position at the mean of sample data deflates the variance measure in the sense that the variance is lowest around the mean of the sample data. This is the only reason why we need to inflate the variance measure with manipulation in the denominator, replacing n with n-1 to compensate for under-estimation.

If a thought comes into your mind that by using the modified formula too we are still not doing a good job at estimating, recollect that estimations improve when we increase the sample size (here in this example we are taking only a sample size of two, increase in the sample size will eventually reflect in the accuracy of estimations)

*This concept is also proven in the Central Limit Theorem blog in detail

Now why the denominator has to be n-1, why not n-2,n-3, or n-10?

When we take a sample out of a population data, it is obvious that its mean gets fixed (a constant value):

Let’s take an arbitrary sample = 2,5,7,10,16 (5 data points)

Sample Mean = (2+5+7+10+16)/5 = 8

Now if I give someone 4 data points of this sample along with the sample mean (which is fixed), would he/she be able to solve for the 5th data point? The answer is yes!

This is the key point, in sample statistics we lose one degree of freedom because there is a constraint of the sample mean (fixed). That’s why we always deduct 1 from the sample size = n-1

This n-1 correction is originally known as Bessel’s correction and if you are interested in detailed derivation, please do refer to the below-mentioned link:

For population statistics, we don’t have any constraint because population data points selection doesn’t have any impact on the actual population mean, that’s why N is used as the denominator for computations.

That’s it for this blog, I hope this simplified explanation has finally cleared the doubt of using n-1 instead of n in the denominator for sample statistical measures that get underestimated. The only source of bias originates from the estimation exercise which is the only realistic way of reaching a close approximation of population statistical measures. The ultimate conclusion is to make biased variance output(underestimated - obtained via sample computation) unbiased by inflating the same using Bessel's correction.

I will be covering a lowkey yet very important concept of ‘degrees of freedom’ in the upcoming blogs since it is of great relevance to address major interpretation challenges in statistics, keep a watch

Thanks!!!

--

--