Sunday, February 05, 2017

Generation Y

Content Warning: Mathematics, Probability, Monte-Carlo Simulations

I was thinking of the academic paper Steve wrote about the other day, the one that documented the discovery that 60% of the current European male population carried the Y-chromosomes of only three men living 5000 (or so) years ago. This was represented as showing extraordinary reproductive capacity among those three men and their male descendants, but I was curious as to how much of that capacity could be explained by random variation.

For instance, imagine 1000 men have two children each. Assuming an equal probability of having a boy or a girl, the rules of probability require the expectation that 250 of those 1000 men have two girls; thus, their Y chromosomes are eliminated in the first generation. And 250 families will have two boys; thus, their Y chromosomes will be held by half the first generation. It logically follows that, given a sufficient number of generations, random chance will eventually leave only one of the original 1000 Y chromosomes.

The math gets highly iterative after this, and I wanted to see how the distribution of Y chromosomes would change over time. But I wanted to alter the model to account for a growing population. I also wanted to let the number of children per family vary randomly as well, and for this I chose a Rayleigh distribution.

So, naturally, I wrote a Monte-Carlo simulation in Matlab:

% Ygen:  Calculate remaining Y-chromosomes over generations
clear; close all;
N=10000;    % Initial Male Population
Y=1:N;      % Chromosomes numbered 1 to N
nGen = 40;  % Number of generations
mu = 2:.05:2.2;             % TFRs to try
topone = zeros(5,nGen);     % Share of Y held by top 1%
cdf = zeros(5,round(N/10)); % Cumulative Density Function of Y
tPop = zeros(5,nGen);       % Total Population
tY = zeros(5,nGen);         % Remainin Y
for i=1:5,   % Total fertility rate (TFR)
    sigma = mu(i)/sqrt(pi/2); % scale parameter for Rayleigh distribution
    M0 = Y;     % INitialize array of Y
    tM1 = N;    % Initialize total number of men
    for j = 1:nGen,
        tM0 = tM1;
        cC1 = round(random('rayl', sigma, 1, tM0)); % number of children
        cM1 = binornd(cC1, 0.5); % number of male children
        caM1 = cumsum(cM1);
        tM1 = caM1(end); % Total number of men in next generation
        M1 = zeros(1,tM1);
        % Calculate Y for next generation
        % M1 = repelem(M0, cM1); % Not available until R2015
        % Alternative for R2014 and earlier:
        M1(1:caM1(1)) = repmat(M0(1),1,cM1(1));
        p = 1;
        for k = 2:tM0,
            q = p + cM1(k) - 1; % Current stopping index
            M1(p:q) = repmat(M0(k),1,cM1(k));
            % M1((caM1(j-1)+1):caM1(j)) = repmat(M0(j),1,cM1(j));
            p = q + 1; % New starting index
        % End Alternative
        M0 = M1;
        tPop(i,j) = tM1;
        tY(i,j) = numel(unique(M0));
        disp(['TFR = ', num2str(mu(i)),', Gen ', num2str(j), ': ', ...
            num2str(numel(unique(M0))), ' Y lines among ', ...
            num2str(tM1), ' men.']);
        n = hist(M0, Y)/numel(M0);
        ns = sort(n,'descend');
        topone(i,j) = sum(ns(1:round(N/100)));
    cdf(i,:) = cumsum(ns(1:round(N/10)));
h1 = figure('Name','Top 1% Y Share');
title('Share of Y-chromosomes carried by top 1%');
ylabel('Y-share'); xlabel('Generation #');
legend('TFR = 2','2.05','2.1','2.15','2.2');
h2 = figure('Name','CDF of Y');
n = hist(M0, Y)/numel(M0);
ns = sort(n,'descend');
title('Cumulative Density Function of Y after 40 Generations');
ylabel('Y-share'); xlabel('Y');
legend('TFR = 2','2.05','2.1','2.15','2.2');
h3 = figure;
title('Total Male Population');ylabel('Population'),xlabel('Generation #');
legend('TFR = 2','2.05','2.1','2.15','2.2');
title('Remaining Y');
xlabel('Generation #');
legend('TFR = 2','2.05','2.1','2.15','2.2');

As you can see, I examined five values of total fertility ranging from a mean value of two children per family to 2.2 children per family, and measured several quantities over 40 generations from a starting population of 10000 men. Note that because the distribution of family size is (roughly) Gaussian, some families will have zero or one child, and some families will have more than two. This may not sound like much of a range, but over 40 generations, it makes a big difference in the total male population:

We can also see the elimination of Y-chromosomes over time:

We can see that the higher TFRs retain more of their initial number of Y chromosomes longer. This is likely a combination of the populations getting bigger faster, making it progressively more difficult for random chance to produce an all-girl generation for a single chromosome, and also a relatively lower probability that any given family would produce only girls, given that they get more chances for a boy.

Now let's look at the cumulative density functions for the Y-chromosome distributions:

For instance, we see that with a TFR of 2 (and therefore a constant male population around 10000), only 500 or so of the original 10000 Y-chromosomes remain after 40 generations. In contrast, for a TFR of 2.2 (and a population approaching a half-million), the top 500 Y-chromosomes are only carried by about half the male population; the other half carry one of the other 1700 Y-chromosomes still in existence.

Finally, let's look at how the share of the male population carrying the top 100 (or 1% of the original) Y-chromosomes changes over time:

In contrast to my other graphs, this data is noisy. Better graphs could be generated with more lengthy simulations. Feel free to borrow this code and try it yourself, and if you have the 2015 version of Matlab or later, you can avail yourself of the repelem() function, which makes these simulations much faster. In any case, here we see that as the population gets higher, it becomes increasingly difficult to eliminate Y chromosomes from the gene pool. We also see that in none of these models do even the top 100--let alone the top 3--chromosomes account for 60% of the male population after 40 generations.

Of course, 40 generations is only, what, 1000 years? Not 5000 as per the paper. And I will speculate that the reproductive history of Europe has been characterized by much higher TFR rates than those I have examined, punctuated by war, pestilence and plague, and thus plenty of opportunities for Darwinian struggle. But certainly, multi-generational differentials in fertility are necessary to account for the observed Y distributions among the present numbers of Europeans.

Let's rerun the above code for N = 1000 and nGen = 100 to produce this:

So, if I keep my male population low enough (1000) for long enough (100 generations), I can (at least on this particular simulation) get the top three Y chromosomes up to a 46% share of the population. But look at this graph from the same simulation:

Note that with a growing population (TFR = 2.2), between generation 50 (population 131K) and generation 100 (population 15.6M), the number of extant Y chromosomes stabilized at 216 from the original 1000. Not one single Y chromosome was eliminated in over a thousand years of procreation. So while it is theoretically possible that random chance would eventually eliminate all but one of the original 1000 Ys, my expectation is . . . well, the sun will probably have fizzled out by then.


Doug Cranmer said...

Proof by Matlab, eh?

These numerical results are only useful to validate exact analysis. The data should be presented as points with error bars and not by connected lines.

It is the exact analysis that should be plotted as lines. First thing they teach you in University.

If you're unable to do that you should not be allowed anywhere near a Matlab license.

Dr. Φ said...

Doug: thank you for visiting.

First, nowhere (I jjust checked) did I use the word "proof". Second, if you are asserting that Monte Carlo simulations are not a valid basis for statistical inference, it would have been helpful had you brought a citation.

But that said, I would very much like to see an "exact analysis" if you would like to take the project on.

Doug Cranmer said...

"Proof by Matlab," is a saying in the mathematical world.

And demanding that someone else do the real work is pretty typical with Matlab jockeys.

Do it yourself.