Biased and Anti-Biased Variance Estimates

Suppose S is a set of numbers whose mean value is X, and suppose 
x is an element of S.  We wish to define the "variance" of x with 
respect to S as a measure of the degree to which x differs from the
mean X.  It turns out to be most useful to define the variance as
the square of the difference between x and X.  We'll denote this
by V(x|S) = (x-X)^2.  Furthermore, we define the variance of any 
subset s1 of S as the average of the variances of the elements of 
s1.  Thus, given a set s of n numbers x1, x2, ..., xn, from a set 
S whose mean is X, the variance of s with respect to S is given by

                      1    n
          V(s|S)  =  ---  SUM (xi - X)^2                (1)
                      n   i=1

It's important to note that the value of X in this equation is
the mean of ALL of set S, not just the mean of the values of s.
If, for some reason, we don't know the true mean of S we might
try to apply formula (1) using an estimated mean based just on
the values in s.  Thus, if we define X' = (x1+x2+..+xn)/n, we
could use this value in place of X in equation (1) to estimate
the variance of s.  However, this would result in a biased
estimate, because X' is biased toward the elements of s.  Thus
each difference (xi-X') is slightly self-referential, tending 
to underestimate the true variance of xi with respect to the 
full set S.

What if we try to eliminate the bias by simply removing x_i from 
X'?  In other words, let's define X'\i as the average of the n-1 
measurements excluding x_i.  At first we might think this would 
lead to an unbiased estimate of the variance, but that's not 
right, because by specifically *excluding* the measurement x_i 
from the mean when evaluating each term (x_i - X'\i)^2 we are
effectively creating an ANTI-biased formula, tending to OVER- 
estimate the variance.  What we need is something in between 
the biased and anti-biased estimates.

If we define the ordinary (biased) variance of s with respect to
S as
          V = (1/n) SUM (x_i - X')^2                 (2)

and the "anti-biased" mean variance as

         V* = (1/n) SUM (x_i - X'\i)^2               (3)

then it's easy to see that

            (n-1)^2 V*  =  n^2 V                     (4)

and so we have

           ___      1    n
          /V*V  =  ---  SUM (xi - X')^2                (5)
                   n-1  i=1

Thus, the estimates V and V* are (sort of) duals of each other, 
and their geometric mean gives equation (5), which we recognize
as the unbiased variance estimate of the underlying set S based 
on s.  In fact, if we could think of some good simple reason why 
the unbiased estimate MUST equal sqrt(V*V), this would constitute
a simple derivation of the unbiased estimate.

Of course, the idea of the "unbiased estimate" is that if we 
draw out a sample of n items from an unknown population and 
compute the variance for that sample using equation (2), then 
we take another sample of n and compute the variance for that
sample, and so on, and then after awhile we take the mean of
all these variances, the average would approach not V(S|S) but
rather [(n-1)/n] V(S|S).  

As an example, recall that the distribution of variances of n-
samples from a normal distribution has a "chi-square" distribution 
with n-1 degrees of freedom.  Thus, in order to have a measure of 
variance that converges precisely on "sigma^2" for a normal 
distribution, we have to divide SUM(x_i - X)^2  by n-1 instead
of n.  In other words, we have to use the unbiased estimate given
by equation (5).

Incidentally, another way of expressing the unbiased variance
estimate is to use a "weighted" mean X"\i defined as

                      x1 + x2 + ... kxi + ... + xn
            X"\i  =   ----------------------------
                                (n-1) + k

where k is the "weight" assigned to xi to get an effectively
unbiased estimate of the mean X.  If we substitute X"\i in place
of X'\i in equation (3) the result will equal the unbiased estimate
if and only if


  / 1  n            1            \  k^2 + 2(n-1)k - (n-1)
 (  - SUM xi^2 - ------ SUM xi xj ) ---------------------  =  0
  \ n n=1        n(n-1) i<>j     /        (n-1 + k)^2


which implies that the correct "weight" for any given n is
                             ________
          k_n  =  -(n-1) +- / n(n-1)

Also, the left-hand factor shows that the estimate is unbiased for
ANY weight k if the values of the n numbers are such that

             n              /  n    \2
          n SUM  xi^2   =  (  SUM xi )
            i=1             \ i=1   /

Given any set of x values (which may be complex) we can create 
a "null-biased" set just by adding one more number.  We can also 
add more numbers while maintaining the above null bias condition, 
but after the 2nd number all the remaining numbers are identical.
For example, given the set {0,1} we can add (1+-sqrt(-3))/2 to
create a null-bias set with three elements, and then we can add
(1/2 + sqrt(-3)/6) to create a null-bias set with four elements.
Thereafter we can only increase the number of elements by adding
duplicates of the 4th element.

Return to MathPages Main Menu
Сайт управляется системой uCoz