Saturday, August 17, 2019

A Conspiracy Theory, Examined

Apparently, there were three people at the random mass murder in Las Vegas in 2017 that were also at Gilroy Garlic Festival random mass murder a couple of weeks ago. Barbie writes:

I'm responding to Vox's OP: The odds against one person in a country of 320 million being in the vicinity of two such events are astronomical.

Las Vegas 2017 attendance: 20,000

Gilroy 2019 attendance: 80,000

I don't know how many attendees were actually physically present at each event at the time of the shootings, but I'll assume two thirds, so 14,520 (sic) and 52,800.

Proportion of US population present at LV shooting: 14,520 / 350,000,000 = .000041 or .0041%

Proportion of the population NOT at LV is the inverse or 99.9959%

Likelihood of one person being at both events is then: 1 - (.999959^52,800). Which is 88.8%. The number of times this apparently happened is 3, so it's 0.888^3, or 70%.


Vox counters:

For example, if the odds of rolling a six on a six-sided die are 1 in 6, then the odds of rolling two sixes on two different six-sided dice are (1/6) * (1/6) = 1/36.

But before we calculate the probability of these two specific independent events, let's get the base numbers right. The Gilroy Garlic Festival is a three-day event, so that 80,000 is reduced to 26,667 before being reduced another one-third as per Uncephalized's assumption to account for the timing of the event. This brings us to an estimated 17,787 people present at the time of the shootings. Note that reducing the estimated 20,000 Las Vegas attendance by the same one-third gives us 13,340, not 14,520.

It's never a good sign when they can't even get the simple division right. Now for the relevant probabilities.

  • Gilroy probability: Dividing 17,787 by 350,000,000 results in a probability of 0.00005082, or one in 19,677.

  • Las Vegas probability: Dividing 13,340 by 350,000,000 results in a probability of 0.00003811428, or one in 26,237

  • Gilroy AND Las Vegas probability: Multiplying 0.00005082 by 0.00003811428 results in a probability of 0.0000000019369677096, or one in 516,270,868.

Both Vox and Barbie are wrong, but Barbie is closer.

Vox's error is the easiest to apprehend, in that he his asking the wrong question: what is the probability of a specific American, or three specific Americans -- say Vox, Barbie and me -- being present at both the Gilroy and Vegas shootings. But the correct question is: what is the probability of overlap between the groups present at Gilroy and Las Vegas. To use Vox's dice analogy, the probability of rolling double sixes is indeed 1/36. But given that I rolled a six on the first roll, the probability of rolling a second six is still only 1/6. This is called conditional probability and is adequately described by Bayes Theorem, on which I will not elaborate here.

Barbie's mistake is more subtle. The first clue is that her formula does not converge to 100%, as it must if, in the extreme case, the sum of the attendance at Gilroy and Vegas were to exceed the population of the U.S. If there were, say, 200M people at each, there would must be some overlap by at least 50M people, but applying her formula as written, the probablity of overlap would still not equal 100%, though it would be close. Her error is that each person attending Gilroy that was NOT at Vegas must be subtracted from each subsequent ratio.

Let me explain it with a toy problem so we can count the overlap ourselves. Suppose I have a bag of 6 balls, lettered A thru F. This analogizes to the population of the U.S. I randomly draw 3 balls out. They can be any balls, but let's say they are balls lettered A, B, and C. (This analogizes to the subset of the population present at Vegas in 2017.) I replace the balls, and again randomly draw 3 balls. (This analogizes to the subset of the population at Gilroy in 2019.) Those familiar with combinatorics will recognize that number of possible combinations of 3 balls selected from 6 as equal to nchoosek(6,3) = 6!/(6-3)!/3! = 20 different possible combinations. These are:

A, B, CA, B, DA, B, EA, B, F
A, C, DA, C, EA, C, FA, D, E
A, D, FA, E, FB, C, DB, C, E
B, C, FB, D, EB, D, FB, E, F
C, D, EC, D, FC, E, FD, E, F

Notice that of these combinations, only one has no overlap with the first event: combination D, E, and F. Thus the probability of overlap is not (3/6)^2 = 1/4 as Vox would have it, nor 1 - (3/6)^3 = 7/8 as Barbie would, but rather 19/20 = 95%.

There are several ways of getting this number. The easiest is this: When I reach into the bag for the second event, the first ball I remove has a 3/6 probability of not having been drawn in the first event; in other words, one of balls 4, 5, and 6. But once I draw one of those balls, and reach into the bag for the second ball, I now have only a 2/5 probability of drawing a non-first-event ball, one of those balls having already been removed. Likewise, the probability of drawing the last remaining non-first-event ball is 1/4. Thus, the probability is arrived at with 1 - (3/6)(2/5)(1/4) = 1 - 6/120 = 114/120 = 95%.

Applying this formula to the mass shooting overlap problem, using e.g. the population of the U.S. at 350M, the subset at Vegas as 52800, and the subset at Gilroy as 13340 gives us: 1 - [(350M - 52800)/350M][(350M - 52800 - 1)/(350M - 1)][(350M - 52800 - 2)/(350M - 2)] . . . [(350M - 52800 - 13339 )/(350M - 13339] = 86.6%. The reader can verify that the same result is arrived at by exchanging the 52800 and 13340 as applied in this formula.

Both Vox and Barbie agree that calculating the probabilities of overlap of two or more people is merely a matter of raising the probability of one overlap to the requisite power. This is also wrong, as we can see by inspection in our bag of balls problem. Looking at the list of 20 combinations above, we see that half of them contain either 2 or 3 of balls A, B, and C. So the probability of an overlap of two or more balls is 50%, not (19/20)^2 = 90.25% as Vox and Barbie would have it.

How can we get the overlap of 10? Here we will adopt a new approach, though it will yield the same answer as in our previously solved problems.

First, look at the pattern of an overlap of exactly two balls of the three from the first event. This is nchoosek(3,2) = 3 combinations:

A, BB, CA, C

Second, each of these combinations fills the last remaining ball with 3 - 2 = 1 ball selected from among 6 - 3 = 3 balls (D, E, and F) that were not drawn in the first event, i.e. nchoosek(6 - 3, 3 - 2) = 3 combinations. The product of these numbers is 9. (I know this seems trivial, but these formulas will generalize to much bigger numbers.)

We repeat this process for an overlap of exactly 3 balls. There are nchoosek(3,3) = 1 possible combinations of the first event balls repeated nchoosek(6 - 3, 3 - 3) = 1 time, for an additional 1 combination. 9 + 1 = 10. Dividing by the total number of combinations nchoosek(6,3) = 20 gives us the probability of an overlap of 2 or more at 50%

The general formula for the probability of overlap greater than or equal to R between two groups of sizes V and G drawn from a population P, is therefore:

min(V,G) is the smaller of the number at the two events, and the () is the standard mathematical symbol for nchoosek().

Note that no computer in the world will calculate the values of factorials as high as this problem calls for; however, we can calculate the log of Stirling's estimator as modified by Gosper: log(n!) = 0.5 * log[(2 * r + 1/3)*pi] + r * log(r / e). We can therefore calculate the log of the probabilities by adding and subtracting the logs of the factorials and converting them before the summation.

I know you're all looking for the answer. The probability that three or more individuals (R = 3) in America (P = 350M) attending Gilroy (G = 13340) would also attend Vegas (V = 52800) is . . . 32.7%. Give or take. This algorithm is pretty complicated, so I ran it swapping the P and G values (it takes several minutes), and I arrived at the same result as I expected, so I'm actually pretty confident this is the correct answer.

Please note that the above analysis assumes that attendees at both Gilroy and Vegas were selected randomly from the U.S. general population. That assumption is almost certainly false. My intuition was that people living within driving distance of Vegas and Gilroy were far more heavily represented at these events than those who were not. But who knows? I personally would have had no interest in either event; in contrast, there may be a class of people who travel vast distances to attend fairs, festivals, conventions and the like. (I have a colleague who recently crossed state lines to attend a board game convention. I have no idea why SWPLs do what they do.)

Also note that this applies for "any three individuals". If there were something special about them, for instance that they all turned out to be the only members of a radical pro-confiscation outfit that we might suspect of shenanigans, then the probability of their randomly being at both Vegas and Gilroy starts to look a lot more like Vox's numbers.

UPDATE: I eventually realized the the additional probability added by considering higher and higher levels overlap eventually fell below the smallest floating point value. Once the probability calculation falls below zero, the summation in the equation above can be safely terminated. That reduced the run-time from a few minutes to a few seconds.

Which means I could play with some additional numbers. For instance, using Vox's estimate of 17,787 at Gilroy instead of 52800, I calculated the probability of a one-person overlap between Gilroy at Vegas at 49%, and a three-person overlap at 3%. This means:

  • 1. that attendance at festivals isn't really drawn randomly from the U.S. population;

  • 2. that mass shootings are common enough that even a 3% probability is going to hit every so often; or

  • 3. the people turning up at multiple mass shootings aren't randomly selected.

My intution is that option 1 explains LV-Gilroy. Using Barbie's 3000 number for Parkland, the probability of a one person overlap with Las Vegas is not quite 11%. I think option 2 explains LV-Parkland. Randon mass murders aren't very common either as a percentage of the U.S. population or as a fraction of the murder rate, but they're common enough for 11% probabilities to occur.

I should note we know the LV-Borderline isn't random according to the official story: Borderline is a popular college student hangout, and a lot of college students from that area had gone down to the Route 91 whatever in LV in 2017. They even had T-shirts made up that said "I survived Vegas" or something, and the murderer chose Borderline for that reason. But for completeness, assuming that there were 200 people at Borderline that night, the probability of a random Vegas overlap is less than 1%.