Friday, February 10, 2012

Great Schools LRA for Fun & Profit

I collected* the rating and demographic data for the elementary and middle schools in the greater metropolitan area of which Phi’s lily-white little burg is a satellite.

There were 68 schools on the list, of which 60 had both demographic data and a rating.  Interestingly, the schools in Phi’s lily-white little burg were included among those returned in a search of the greater metropolitan area (and indeed accounted for three of the four schools rated “10”) even though we are independently incorporated and have our own school district, while most of the other satellite cities were not included.  This may skew the results, but I’ve left them as they are for the purposes of this post.

MS Excel’s linest() function gives it’s output in the following format:


The m and b values in the first row are the coefficients of the regression line in the format y = m1x1+m2x2+ . . . + mnxn+b and the other values are listed below:

Standard error value for the constant b
R2 Coefficient of determination
Sey Standard error value for y
F F statistic
Df Degrees of freedom
Ssreg Regression sum of squares
Ssresid Residual sum of squares

While I will provide all the output from the linest() function, if I understand what I’m doing, the value that interests us the most is the coefficient of determination, R2, which measures the percent of the variation accounted for by the inputs.  Note that R is the correlation coefficient.  The Se and SS values are relative to the scale, and F is only useful for comparing the results of a controlled experiment.

Here are my results:

GSR by percent of student body that is African-American












Note that with only one input variable, there is only one value of m.

32.8% of the variation in a school’s GreatSchools rating is accounted for by the percent of the student body that is black.  The correlation coefficient is therefore 0.57.

Not all correlation coefficients are mathematically meaningful.  It depends on the number of samples.  In the back of my text, Elementary Statistics (Triola), I have a table listing the critical values of R for two different α values.  (α is basically the probability that, assuming no relationship between x and y, I could get a set of results with these statistics in a random draw).  For a sample size of 60, the critical values of R are .254 for α = .05 and .330 for α = .01.  Since R = 0.57, I can be pretty confident that my results are statistically significant.

Whether or not they are operationally significant is another question.  My m = –0.042 implies that in order to drop my GSR by a single value, I would need to increase the black share of its population by almost 24%.

Here are the results for Hispanics.

GSR by percent of student body that is Hispanic

-0.177847114 4.477951118
0.110464223 0.45456173
0.042779373 2.796730734
2.59209171 58
20.27457098 453.6587624


A much stronger m value, but the correlation is not significant.  In only two schools in my sample does the percent Hispanic reach double digits.

Let’s find the results for the sum of the black and Hispanic percentages:

GSR by percent of Student Body that is NAM

-0.04541439 6.316920233
0.00789716 0.494386832
0.363133052 2.281229537
33.07082753 58
172.1008577 301.8324756


A slightly higher coefficient of determination.  Again, though, not much in the way of operational significance.

Let’s look at a scatter plot GSR by percent NAM:


(Please pardon the crudity of this graph.  I’m a Matlab guy, not an Excel guy, and I was too lazy to figure out how to label or trim the axes.)

A couple of observations:

The data appear to thin out in the 40 – 80% range.  Most of the schools are either NAM or not-NAM.

The effect of NAM percentage looks concentrated in the 0 to 20-25% range.  Once the NAM percentage reaches this threshold, the school on average seems to be as bad as it’s going to get.

I suppose the follow-up question would be to divide my data into two groups, and consider the 0-25% NAM and 25-100% NAM groups separately.

GSR by NAM percent 0 – 25

-0.29849 9.582504
0.057868 0.775821
0.558875 1.881964
26.60551 21
94.23112 74.37758


Now we’re cooking with gas!  My coefficient of determination is now almost 56%, and my correlation coefficient easily meets the more stringent critical threshold for my reduced data set for α = .01.  My new slope m = –0.298 means that in the range 0 – 25%, my GSR will fall by one for each increase of 3.35% in my NAM percentage, a definite operational significance.

This is only one metropolitan area.  I had the advantage of choosing a city with fairly equal percentages of whites and blacks, so I didn’t suffer from range restriction.  I will test these observations against other data sets if I have time, but I encourage my readers who have MS Office to undertake their own studies.

* Getting data from Great Schools can be tedious if done manually.  I will discuss this issue in a subsequent post.

No comments: