Why physical appearance is an imperfect individual proxy for ancestry

Kalash children

Pictured above are some Kalash children. You notice in the foreground and center a child who could easily pass as European and draw no notice on the streets of Gdansk, Poland. But look at the child right behind her, I would guess she’d draw no notice on the streets of New Delhi!

Though the Kalash are noted for their fair features, most of them look more West Asian than anything else, and from what I can tell as many have a “northwest Indian” phenotype as a “European” one. Genetically we know that they are good proxies for “Ancestral North Indians” (ANI). About ~30% of their ancestry can be modeled as derive from the steppe peoples, such as the Sintashta. Indo-Aryans. The other ~70% of their ancestry is similar to that of the Indus Valley Civilization (IVC) people, which itself can be decomposed as mostly ancient Southwest Eurasian-adjacent (i.e., derived after the Last Glacial Maximum from the ancestors of Zagros farmers) and a minority of ancestry that is more like that of Andaman Island and pre-Neolithic Southeast Asians (“Ancient Ancestral South Indians,” or AASI).

Another thing to note about the Kalash is that they are genetically very homogeneous. This is due to the fact that they live in an isolated region, and their non-Muslim religion means that they have not intermarried with nearby Muslim people. What does this imply? It means that the Indian-looking girl is exactly the same ancestrally as the European-looking girl. Both have the same proportion of AASI and Indo-Aryan ancestry. That being said, the Indian-looking girl exhibits features more like that the AASI than the European-looking girl. Why?

The simple reason is that the genes which vary and encode salient physical features are a much smaller subset than the total genome. Therefore, they are subject to much higher variance from individual to individual (lower N in the denominator).

Here’s a concrete example. Compare eye color to inferring total ancestry and your total ancestry. Modern SNP-array ancestry inference relies on 100,000 to 1 million genomic positions. It is pretty good as a proxy for the 10 to 100 million SNPs out of your 3 billion base pairs that define your variable ancestry. For eye color, there are a few dozen genes at most, and more honestly a handful that really impacts variation. For Europeans, 75% of the variation of blue vs. non-blue eye color is due to variation around one genetic region, the HERC2-OCA2 locus. This means that just because someone has blue eyes, one can’t be sure that one has much European ancestry at all!

In the 1000 Genomes South Asian populations the SNPs for “blue eyes” are 2 to 10% frequency by population. Since the expression is recessive (you need both copies of the “blue eye” variant), assuming just this SNP you’d expect 0.05% to 1% manifestation of the characteristic in Indian-origin populations. The people with blue eyes have no more or less European ancestry than anyone else in their family.

Where does this leave us? You should understand from this that within a given family or ethnic group there is going to be a range of appearances, and a range is normal within many groups without exotic ancestry. Most Bengalis have 5-20% East Asian ancestry (closer to 5 in West Bengal, closer to 20 in Comilla and Chittagong). This means most of their ancestry is South Asian, and most Bengalis look just like other Indian-origin people. But a substantial minority look somewhat East Asian, to varying degrees. This is exactly what you expect when you have a minority quantum of ancestry.

Finally, many of the commenters here made a lot of assumptions about vloggers talking about their ancestry and were quite rude. I wish you wouldn’t do that. As a matter of fact, many of the inferences may actually be correct, but you don’t know for sure, and you don’t know the whole story. I’m pretty liberal on the comments of this weblog, but if you exhibit a serial pattern of rudeness I’m going to start randomly deleting your comments (if you complain about this I will immediately ban your IP).


Most Bangladeshis are 10% to 20% East Asian

I wish consumer genetic tests did a better job of communicating the madness to the methods. The vlogger above is a bit confused because one of her grandmothers looks rather East Asian, but her DNA results clearly indicate her Bengali ancestry. What the Ancestry DNA test does not make clear is that Bengali ancestry includes within it 10-20% East Asian ancestry.


Indus Valley, Sintashta, and Andamanese ancestry in select grioups


I ran some qpAdmin on some populations. In the table below if it’s empty, that means that the model isn’t very good with that population. In other cases, the model doesn’t work without a population. So, if you put East Asians into the model for most South Asians it kind of goes crazy…but without East Asians, Bengalis and Munda are not modeled too well.

I used the exact left and right populations as outlined in the Narasimhan et al. paper when possible. You can see that East Asians are part of the model for Bengalis, so they are removed from the “right” set of populations in that model.

My results are very close to Narasimhan et al. (the main difference is my reference set is slightly different than that of the Reich lab population). Additionally, please note my intuition is that this overestimates Sintashta ancestry by a few percent. That being said, take a look at the Ror (Jatt), Khamboj, and Brahmins from Uttar Pradesh. The Ror have more Indo-Aryan and more Andamanese than the Kamboj. The Uttar Pradesh Brahmin is about the same fraction Indo-Aryan as the Kamboj but has about ten times as much Andamese ancestry.

Using my own data to test some stuff, and I notice

1) My parents are both “outliers” from the Bangladeshis collected in Dhaka. Not too surprising, as my family is from low country Comilla, and more “East Asian” than usual.

2) My father is more “steppe shifted.” This always shows up in various analyses. And, it is not surprising. His maternal grandfather was from a Bengali Brahmin family (they all converted the previous generation).

3) Weirdly, I am quite near my father on this plot. Mendelian segregation I assume. I have a 23andMe and a SNP file generated from 30x WGS, and they land on the same spot. So it’s not some artifact.


Please read Who We Are and How We Got Here

Many questions on this weblog would be answered if the individuals just read Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past. Not all questions would be answered. The book is dated in some ways, and there are certain lacunae. There are also things we still don’t know to any great satisfaction (e.g., Eastern Eurasia is under-understood). But to a first approximation, this book answers most big questions, at least from a scientific perspective.

Though American price on Kindle is $4.99, this may not be feasible for some readers. There are free preprints of almost all of the Reich lab’s publications on the lab’s website.

This post seems relevant since new readers may not be aware of the resources out there.


Genetic odds & ends

At my other weblog I report on evidence that a sample from Cambodia dated to 100 to 300 AD seems to have considerable Indian ancestry. This is not a result in isolation. Lots of evidence points to non-trivial Indian gene flow. The devil is now in the details of when/who.

Second, there is lots of talk about “person X looks like population Y, so perhaps they have ancestry from population Y.” This is almost certainly wrong in most cases.

Looking at Indian populations there tends to be far more variation in physical appearance within a population than the variation of total ancestry. In other words, some Tamil Brahmins look like South Indian Tribal people and other Tamil Brahmins look like West Asians. But in terms of total ancestral components, there’s no difference.

The theoretical explanation for what’s going on is that the genetic loci which control “physical appearance” are much smaller in number than the whole genome (on the order of dozens of loci). As such, the sample variance is rather large (the N denominator is small).

South Asian populations differ across each other, but there is usually a quite large within-population variation on genetic variants implicated in physical characteristics. This means that there are a large range and quite a bit of variation.

Though a lot of the discussion involves Muslims, I have heard from multiple non-Muslim people of Northwest Indian stock (e.g., Pandits) that they must have “Persian ancestry” because they look so Persian. The genetics refutes this rather strongly. Rather, modern Persians and many Northwest Indians share deep ancestry which diverged after the Last Glacial Maximum 20,000 years ago.


The Unravelling of the AMT

The thought of writing this article came as I recalled a recent interview of Vagheesh Narasimhan with the Caravan magazine, where he explains how in his view, the Indo-Aryans must have spread across South Asia.

Before coming to what Vagheesh said in the interview, let us take a brief detour so that his comments could be understood in its proper context.

The Textual Evidence for AMT

Except for the truly ignorant on the subject, it is clear as daylight to all scholars, whether Indian or Western, that the Rigvedic geography is centred in North India, more specifically around Punjab, Haryana & Western UP.  The westernmost lands mentioned in Rigveda are the eastern regions of Afghanistan and these were certainly peripheral in the scheme of things of Rigvedic Aryans.

Yet, through the last two centuries several attempts have been made to parse out some sort of evidence from Rigveda or any of the early Vedic texts, in the form of memory or otherwise, that could support the argument of an extra-Indian homeland of the Rigvedic Indo-Aryans. However all such attempts have come to naught.

Let us go through the opinion of the mainstream western Indologists on the matter so that there remains no room for doubt on the matter.

Edwin Bryant notes in his seminal book,

The first prominent note of discord between traditional exegesis and Western scholarship was sounded because of the lack of explicit mention, in the Vedic texts, of a foreign homeland of the Aryan people. As mentioned previously, this conspicuous silence had been noted even by nineteenth-century Western scholars (e.g., Elphinstone 1841). The absence of any mention of external Aryan origins in traditional Sanskrit sources is, to this day, perhaps the single most prominent objection raised by much of the scholarship claiming indigenous origins for the Aryan culture. (pg 59)

Already in the middle of the 19th century we have scholars such as Curzon (1855) who argues, “Is it legitimate … to infer that because the Aryans early spread to the South . . . and extended themselves over the peninsula, they also originally invaded, from some unknown region and conquered India itself?” (pg 65) and Muir(1860) who notes that “none of the Sanskrit books, not even the most ancient, contain any distinct reference or allusion to the foreign origin of the Indians” (pg 63)

Bryant quotes Srinivas Iyengar, who in 1914 quite pertinently said,

The Aryas do not refer to any foreign country as their original home, do not refer to themselves as coming from beyond India, do not name any place in India after the names of places in their original land as conquerors and colonizers always do, but speak of themselves exactly as sons of the soil would do. If they had been foreign invaders, it would have been humanly impossible for all memory of such invasion to have been utterly obliterated from memory in such a short time as represents the differences between the Vedic and Avestan dialects. (pg 59)

Bryant refers to Indian scholars as early as the latter half of the 19th century who object to the external origins of the Indo-Aryans, which should clear the doubts of those who think that opposition to AIT/AMT is a modern Hindutva invention.

As per Bryant, “… the fact that the Vedas themselves make no mention of any Aryan invasion or immigration reveals a major epistemological concern in this debate. ” (pg 59)

Bryant concludes the chapter thus, “The sequence of texts does seem to suggest a movement of the Brahmanic geographical horizons from the Northwest to other parts of India. Nonetheless, the Indigenous response needs to be considered: the texts give no obvious indication of a movement into India itself. Indigenous Aryanists, on the whole, are prepared to accept a shift of population from the Sarasvatl region eastward toward the Gangetic plain…But they do not feel compelled to then project this into preconceived hypothetical movements into the subcontinent itself in the pre- and protohistoric period.”

Hans Henrich Hock, a well-known linguist and Sanskritist, in his contribution to this major volume, The Indo-Aryan Controversy, also observes,

Some publications claim that the Rig-Veda contains actual textual evidence for an Aryan in-migration…suffice it to state that none of them provide unambiguous clues that the point of origin for these travels was further (north-)west or outside of India/South Asia, or that the direction of travel was to the east or further into India/South Asia. (pg 290)

Hock rather candidly tells us that “…the passages cited by Biswas and Witzel do not provide cogent evidence for Aryan in-migration and thus cannot be used to counter the claim of opponents of the so-called “Aryan Invasion Theory” (e.g. Rajaram and Frawley 1997: 233) that there is no indigenous tradition of an outside origin.” (pg 291)

Another major linguist George Cardona concurs that “… there is no textual evidence in the early literary traditions unambiguously showing a trace of such migration. “(pg 38)

Cardona goes one step further and analyses a particular passage Michael Witzel, an ardent proponent of the AMT, cites from the Baudhayana Srauta Sutra, to support his argument of textual evidence.

American Caste (b)

America has a national crisis in math capacity, competence and merit. American students sharply underperform students in many countries all over the world. Including Vietnam, which is a poorer country than India per capita. We will heavily refer to the 2018 OECD PISA report in below paragraphs, but the below chart graphic is from the 2015 OECD PISA scores report because math scores are reported for more countries in the 2015 report. Perhaps the 2018 report will be revised to add more countries in the future:

In my view  a level 5 PISA score is the minimum requirement for a person to be considered a high school graduate who is literate in math, able to function in the modern global economy, or be qualified to attend college. The PISA report defines a level 5 PISA score or better as a fifteen year old that “can model complex situations mathematically, and can select, compare and evaluate appropriate problem-solving strategies for dealing with them.” How does America perform in the 2018 PISA report?:

  • United States: 8% of students scored at Level 5 or higher in mathematics
  • OECD average: 11%
  • Six Asian countries and economies had the largest shares of students who did so:
    • Beijing, Shanghai, Jiangsu and Zhejiang (China): 44%
    • Singapore: 37%
    • Hong Kong (China): 29%
    • Macao (China): 28%
    • Chinese Taipei: 23%
    • Korea: 21%

Note that these six countries were among the poorest countries in the world in the 1950s, far poorer than poor Americans or poor Europeans or poor Chileans can even imagine. In 1979 China was unbelievably poor. Much of the population of China–perhaps as many as 100 million–had starved to death because of extreme poverty in the 1970s. Poor children around the world are outperforming American children in mathematics despite extremely low education spending per student and very low socio-economic level of their legal guardians, where socio-economic level is defined as:

  • income
  • wealth
  • formal education of parents

Do any American high school student subgroups perform well in Mathematics? Yes, “people of color” or “minority” Americans perform well in Mathematics. America’s “people of color” or “minority” students are orders of magnitude more likely to get an 800 on the mathematics SAT than European Americans. If we assume this is an extreme tail end distribution issue related to European Americans having a lower standard deviation and non standard distribution in mathematics performance relative to “people of color” or “minority” Americans, we can explore the breakdown of Americans who score between 750 and 800 on the Mathematics SAT. Here European Americans perform far better relative to “people of color” or “minority” Americans.  In 2015 16,000 European Americans scored 750 or higher. 33,000 “people of color” and “minority” Americans scored 750 or higher. We further know that 51% of SAT test takers were European Americans and 49% were “people of color” or “minority” Americans.  “People of color” or “minority” Americans are [33,000/16,000]*[51%/49%] or 2.15 times as likely to score 750 or higher on the mathematics SAT compared to European Americans.  If we examine the 107,900 test takers who got SAT math scores of 700 or higher; 59,900 are “people of color” or “minority” Americans, versus 48,000 European Americans. “People of color” or “minority” Americans are [59,900/48,000]*[51%/49%] or 1.30 times as likely to score 700 or higher on the mathematics SAT compared to European Americans. For data junkie geeks like me there is a lot more data on SAT math score distributions here and here. The Greta Anderson article’s comment section in particular has some very intelligent commentators who have studied the American SAT score distribution. This is likely to be the subject of many future blog posts and Brown Pundits Podcasts.

What about this is worrying?:

  1. European Americans in particular are sharply under-performing both very poor children around the world and “people of color” and “minority” Americans in mathematics.
  2. American mathematics SAT scores have fallen between 1972 and 2016. 1972 is the earliest year for which I could find comparable SAT mathematics scores. In 2017, 2018 and 2019 the SAT mathematics exam was completely restructured to make scores no longer comparable to SAT mathematics scores between 1972 and 2016.
  3. 90% or more of current jobs and businesses are likely to be replaced by artificial intelligence (AI), brain electro-therapy (meditation . . . practiced by civilizations around the world for over 5,000 years), brain sound therapy (naad or mantra yoga and their equivalents in Native American, Egyptian, Sumerian, Taoist and other civilizations around the world for over 5,000 years), bio-engineering tissue, genetic editing, and fused AI-brain interface synthesis intelligence. Almost all of these future disciplines are complementary to mathematics.

Future articles and podcasts are planned all six of these future disciplines. If you are curious about fused AI-brain interface synthesis intelligence, please watch my main man Elon Musk:

Some say that the tension and relationship challenges between America’s four big castes–European Americans, European “Latino” Americans, Black Americans and Asian American–are driving low math scores for European Americans “AND” other Americans. One example is where thought leader Mark J Perry explores the possibility that tension between the European American caste and the Asian American caste are lowering American  mathematics performance. Excerpts of his article are reproduced below:

Some admixture coefficients for South Asian Genotype Project members

I decided to run qpAdmin on a large number of the South Asian Genotype Project members. The codes should be self-evident for the individuals. The Indus Periphery samples are from the Reich dataset. The steppe is all Sintashta samples from the recent publication (I removed outliers). The Andamanese hunter-gatherers are from the Andamans.

Some of the populations are not good fits on the India cline. Adding Dai as East Asian improves the fit for the Bengali Kayastha. But it messes it up for most of the others.

Please note that these are individuals. There is going to be variance within populations.

A model runs through it

Recently I made a comment that I appreciate what 23andMe and Ancestry have done with their South Asian ancestry updates. My own results came into sharper focus. The algorithms did what they were supposed to do.

Both of the companies found that I’m probably Bengali. 23andMe, with its massive database, and SVM framework, even narrowed down where in Bangladesh my family is from.

Both my parents are from Comilla. More specifically, my mother’s family is from Homna (though her maternal grandfather was from Noakhali by origin). When I was small I was sent to stay with my mother’s relatives in Sreemudi village, which I can now find on Google maps! My father’s family is from just outside of Chandpur. Basically, my family hails from the lower reaches of the Meghna river. And more precisely, the eastern shore of the Meghna.

And yet this analysis is missing something. The term and category “Bengali” has implicit within it other phenomena. I generated a PCA which illustrates this well:

You can see I’m pretty clearly shifted toward East Asians. That’s because that’s common in Bengalis. That seems like it’s interesting information people would like to know. But simply creating a “Bengali” category masks all that.

Speaking of genetics, I finally got around to playing around with qpAdmin. People keeping asking me Bengali percentages of the various ancestral components in the recent Reich lab India paper. Well, I ran the same model (mostly, not exactly sure of all the samples….), and got some results.

 IndusValleySteppeAHG/AASIEastAsianBirhror (Munda)
Punjabi – Lahore0.580.20.1920.03 
Tamil – Sri Lanka0.570.070.38-0.025 
Bengali0.2640.136-0.075 0.675

The “Bengali” sample is from the 1000 Genomes. You can see that 12.5% of the ancestry is “East Asian”. These are Dai. The AHG are modeled as being related to the Andamanese as per the Reich lab paper, and Indus Valley are the pooled IndPe samples. Steppe are Sintashta.

I ran the other 1000 Genomes samples with the same model. The -0.025% for Tamils for East Asian is that this model is really not necessary for them. I kept the East Asian in there to compare apples to apples with the Bengalis.

I also looked at Munda population, the Birhor. The results align perfectly with what we know. The Munda have no steppe ancestry. But, they have a lot of East Asian ancestry. One hypothesis for Bengalis is that they have Munda ancestry. But when I add them to the model you can see the results are crazy. If I swap out the East Asians with the Munda the results make some sense, but standard errors are way higher than in the model with the Dai/East Asians.

Basically, Bengali (Dhaka) samples have East Asian ancestry that’s more like populations to their east, and not like the Munda to their south and west.