## Monday, June 16, 2008

### On Bias

A while back, I made a comment over at Bobvis to the effect that the word bias had a very specific mathematical meaning: it was the tendency to consistently over-estimate or under-estimate a parameter.

To illustrate this concept, let's return to the Poisson distribution from my last math post. You will recall that the Poisson pdf,

 p(n|a) = an·exp(-a) n!
(1)
is related to the probability of some number of events n, known to occur at average rate a, occuring within the time period specified by a. In my last example, we spoke of the number of busses that would drive by a bus stop. We could also use the example of the number of photons received by a telescope, given they arrive at average rate a. We also showed that the expected value of n was a, and that the variance of n was also a.

But: what if we don't know the parameter a? Can we calculate it? Indeed, the straightforward calculation of the expected value of a is:

 E[a] = óõ a p(n|a)p(a) p(n) da,
(2)
where
 p(n) = ∫p(n|a)p(a)da
(3)

However, calculating the E[a] in this manner not only requires our Poisson distribution, but also requires a priori knowledge of p(a), the distribution of a. In practice, we may not have this formula.

In such a situation, we can't calculate the expected value of a directly. But if we have empirical measurements, we can estimate the parameter a from the data itself. There are, in fact, a number of estimators, but in this case, the estimator of choice would be the Maximum Likelihood estimate, defined as:

 ^ a ML = argmaxa p(n|a) = argmaxa an·exp(-a) n!
(4)

The expression argmaxa means the that we will vary a until we maximize the value of the pdf, given that we have made a measurement n of some number of photons (let's say) striking our telescope.

The value of the pdf will take on its maximum value with respect to a when its derivative with respect to a is equal to zero; so

 d da p(n|a) = d da æè an·exp-a n! öø = (n·an-1-an)exp(-a) n! = 0
(5)

 Þ n· ^ a n-1ML = ^ a nML
(6)

 lnn + (n-1)·ln ^ a ML = n·ln ^ a ML
(7)

 ln ^ a ML = lnn
(8)

 ^ a ML = n
(9)

In other words, I can estimate the rate of photons a by counting the photons. You might expect that this would be a very poor estimate, and you would be right; it's roughly the same as estimating the mean IQ of China from the IQ of the first Chinaman I meet.

But what if we have a collection of measurements that we will call N = {Ni, i = 1¼M}. In our example, this is like having M telescopes that each make a reading of the number of photons Ni in the relevant time period. Because these measurements are independent, we can write:

 ^ a ML = argmaxa p(N|a) = argmaxa Õ p(Ni|a) = argmaxa Õ aNi·exp(-a) Ni!
(10)

 = argmaxa  aåNi·exp(-Ma)· Õ 1 Ni!
(11)

Note: all S's and P's are over the M samples.

As above, the value of the pdf will take on its maximum value with respect to a when its derivative with respect to a is equal to zero; so

 d da éë aåNi·exp(-Ma)· Õ 1 Ni! ùû
(12)

 = éë å Ni·aåNi-1·exp(-Ma) - M·exp(-Ma)·aåNi ùû Õ 1 Ni! =0
(13)

 ® å Ni· ^ a åNi-1ML = M· ^ a åNiML
(14)

 Therefore, ^ a ML = 1 M · å Ni
(15)

This conclusion may seem unremarkable: that the maximum likelihood estimate for, say, the number of busses driving visiting my bus stop in an hour, or the number of photons collected by my telescope in an hour, is the average number of observations made during a succession of hours.

The question still be answered is whether or not this is an unbiased estimate. Not a perfect estimate, mind you; there of course will be errors. The pertinant question is whether these errors will tend to accumulate in a particular way, predicting too many photons, on average, or too few.

To answer this question, we must calcuate the expected value of the estimate:

E[
^
a

ML
] = E é
ê
ë
 å Ni

M
ù
ú
û
= 1

M
· å
E[Ni] = M·a

M
= a
(16)

Conveniently enough the expected value of the ML estimate of a is ... a. We say then that the esimate is unbiased; in other words, while any single estimate will have an error associated with it, these errors will not, on average, tend either higher or lower in the long run. average of a sufficiently large number of estimates will not be wrong.

But how much can I expect my ML estimate to be wrong? We can gain some insight into this question by estimating the variance. Since the Ni's can vary, our estimate will vary while the true value stays constant. What is the extend of this variation?

var
^
a

ML
= E[(
^
a

ML
- a)2] = E é
ê
ë
æ
ç
è
 å Ni

M
-
 å a

M
ö
÷
ø
2

ù
ú
û
(17)

 = E éë æè 1 M · å (Ni - a) öø 2 ùû = E éë 1 M2 · æè å (Ni - a) öø 2 ùû
(18)

The next step requires some explanation. We have a square-of-sums term that we wish to convert to a sum of squares.

Normally this would generate all kinds of cross terms; however, because our samples our independent, we know that E[(Ni - a)(Nj - a)]=0, so we can write:

 var( ^ a ML ) = 1 M2 ·E éë å (Ni - a)2 ùû = 1 M2 · å E[(Ni-a)2]
(19)

 = 1 M2 · å var(Ni)= Ma M2 = a M
(20)

Here is the significance of what we've just done: we've shown that as the number of our observations goes up, the variance associated with our estimate goes down. So the larger the sample size, the closer I will usually be to correctly estimating the parameter I am trying to find.

Why is this? Basically, signals reinforce, noise cancels! To deploy an example that will interest the readers of this blog, if I test the IQ of a random Chinaman, my guess that this IQ is the Chinese average will probably not be a very close one, although my estimate will be unbiased. If I meet a thousand random Chinamen, my guess that their average IQ is the Chinese average will be pretty darn close.

Update: Crap!Crap!Crap!Crap!Crap! My equations look like they went through a blender!

Here's the story: Last week I wrote my first paper in LaTex. Once you learn the syntax, formatting equations in LaTex is mildly easier than in MS Word's Equation Editor or MathType, since it can all be done with control sequences instead of mouse-work. But I was particularly taken with the possibility of converting a LaTex file into HTML using a little bit of freeware called TTH. Unlike, say, the HTML conversion by MS Word, the equations are formatted as real HTML, not gif files.

But there turned out to be numerous problems. First, TTH doesn't recognize all LaTex syntax. The second problem is blogger. When I open the TTH output in a browser, it looks fine. When I save that same output into blogger and read it there . . . well, what you see above is the result of many hours of post-processing on a very complicated source file. And it still removed all my fraction bars. So all those over-under strings that look like they'd be fractions if they had a line beteen them, well . . . they once did.

Never again. If I'm writing for the Web, I'll use word.