Tuesday, May 14, 2013

Issues with Ice Age linguistics

Last week I had a few friends ask me about a recently published study titled "Ultraconserved words point to deep language ancestry across Eurasia" by Mark Pagel, Quentin D. Atkinson, Andreea S. Calude and Andrew Meade. It's been making headlines all over the globe in articles with titles like "English May Have Retained Words From an Ice Age Language" (Wired.com), "Ice Age language may share words with modern tongues" (News.com.au and various sites) and "15000-year-old 'fossil' words reveal ancestral Ice Age language" (LA Times).

You can download their report here. Also, the data for the study comes from the Languages of the World Etymological Database, which can be accessed at this site.

As always, Language Log has a great post by Sally Thomason that highlights many of the issues about the study here, including issues with both the data and methodology. Similarly, another post at GeoCurrents by Asya Pereltsvaig rubbishes the study.

Now, before you go and cry 'Academics marking territory!', there are very good reasons to take the study by Pagel et al. with a sea-ful of salt. But let me start with a short personal anecdote and brief introduction into the world of historical linguistics. Also, if you're a believer in Nostratic, you should probably just ignore this post altogether.

Nagaland and the Yucatec Peninsula?

A few years ago, a friend of mine from Nagaland in North-East India saw Mel Gibson's Apocalypto and was astounded that her language and Mayan (technically, Yucatec Maya) shared a number of words in common. She thought the two languages might be related and asked me about it. I told her this was highly unlikely given (a) the geographic distance between the two and (b) the lack of any recent contact between the people of Nagaland and the Mayans. Of course, I could tell she was still sceptical of my response even some time after.

Now my dismissal of her theory wasn't just because I found the geographic distance and lack of recent contact problematic (or the fact that she was basing her observations on translations given in the subtitles). It was the fact that given the geographic distance and the lack of recent contact, the words she cited were just too similar in both pronunciation and meaning. Such similarity between cognates, that is, words in related languages that are descended from the same etymological source (and not through borrowing), is actually highly unlikely. Such words rarely keep both their original form and meaning as time goes by, and the languages they belong to drift apart. As an example, let's look at the Italian word for 'dog': cane (pronounced /ka.ne/, like 'car-nay' with a [k] sound at the start). The French equivalent is chien (pronounced /ʃjɛ̃/ with a sound usually written in English as sh). Despite both words deriving from Latin canis, the modern equivalents in Italian and French sound quite different.

Historical Linguistics 101

(Image by Koryakov Yuri, taken from Wikimedia Commons)

To address this problem of sound change, most historical linguists apply what is known as the Comparative Method. The idea is to look for sound correspondences across a number of words in two languages, and not just individual words in each language that sound identical and mean the same thing. Applying this method reveals that the /ʃ/ 'sh' sound in French (written as ch) regularly corresponds to a 'k' sound in Italian (written as c): compare French chanter with Italian cantare 'to sing', French bouche with Italian bocca 'mouth'. It is these regular sound correspondences that form the basis for genetic groupings of languages, not similarities in the actual forms of the words themselves. Historical linguists will then use these sound correspondences to attempt to reconstruct a 'proto-language' from the forms in the modern languages. Such proto-languages are always theoretical - even 'proto-Romance', a proto-language reconstructed based on modern Romance languages like Spanish, Sardinian and Romanian, is not identical to Vulgar Latin, which had many varieties spoken in across the Roman Empire.

However, even before historical linguists can begin to establish sound correspondences, they first need to identify cognates in various languages. This process of identification is complicated by the fact that words don't just change in pronunciation, they also change in meaning. For example, English dog and Swedish hund /hɵnd/ 'dog' sound nothing alike, even though they share the same meaning. On the other hand, English hound /haʊnd/ and Swedish hund share many similarities in pronunciation, with similar consonants both at the start and end of each word. However, Swedish hund refers to any kind of dog, while English hound refers to only a specific breed of dog. Which word in English would we say is cognate with Swedish hund then? Given the similarity in pronunciation and the somewhat related meaning, hound is the more likely answer.

Now this may not look like a huge semantic leap that could cause much confusion, but a combination of both sound drift and semantic drift can make it difficult to locate cognates. Take for instance, the Swedish word for 'animal', pronounced /jʉːr/, almost like English you're. Based on this spoken form, can you think of a word in English that might be cognate with this?

Unless you know something about proto-Germanic linguistics, I'm guessing that you probably weren't able to work out that the Swedish word for 'animal'written as djur, is actually cognate with English deer. (Yes, the spelling might have helped, but imagine you're working with languages that have no written records.) The word deer in English does not refer to animals in general, but to a specific kind of animal, somewhat analogous to English hound. Speakers of German may have seen the connection, since German Tier means 'animal (in general)' and still sounds similar to English deer. However, the point here is that as languages diverge more over time, the task of identifying cognates between them gets increasingly difficult.

Certain types of sound and semantic change are quite common, and follow well-established patterns. For example, in a number of languages, the word for 'five' is historically derived from the word for 'hand': compare Malay lima 'five' with Hawaiian lima 'hand' (see here for more words for 'hand' in Austronesian languages). However, the rules governing such changes are not necessarily predictive, and at best can only give a probability that a word developed from a particular source. This is when historical linguists can get rather creative in deciding whether two words are cognates or not - disagreements over what words should be used as cognates can lead to rather different reconstructions of what is supposed to be the same hypothetical proto-language.

Swooning over Swadesh lists

To help identify cognates, many linguists start by comparing items from Swadesh lists in various languages. The list was first developed by Morris Swadesh in the 1940s and 50s and contains words that are viewed as belonging to the 'core vocabulary' of all languages, as opposed to culturally-specific vocabulary. Depending on the version of the list, there may be 100 or sometimes up to more than 200 items on the list. The items include nouns referring to body parts like 'heart' and 'tooth', personal pronouns like 'I' and 'we', kinship terms like 'father' and 'mother', some verbs of motion, the numerals 1-5, etc. It was originally assumed that such 'core vocabulary' was more stable over time and underwent replacement by other words in the language at a slow but constant rate, analogous to the process of radioactive decay. Furthermore, there was the implicit belief that words for such 'basic' concepts were not likely to be borrowed from other languages.

Based on such assumptions, Swadesh applied a method called glottochronology to these word lists, which then allowed him to propose dates for when various languages / language families split from each other. Today, this method has been largely discredited, mainly for its flawed assumption that word replacement happens at a steady rate across languages and across all words in a language - although there do remain proponents of this type of research. Furthermore, 'core vocabulary' is not always resistant to replacement by borrowed words. One notable example of this is the adoption of the Chinese numeral system in the genetically unrelated Japanese, Thai and Vietnamese languages.

Despite all these limitations, many field linguists and historical linguists see the Swadesh list as a useful starting point, myself included. But any decent fieldworker or historical linguist would also know that you need to move beyond a Swadesh list consisting of some 200 items (at the maximum) if you want to get any real insight into a language and its past. One needs to go beyond studying the etymology of only 'core vocabulary' and look at other areas like morphology (e.g. prefixes and suffixes), syntax, as well as sociolinguistic variation. Some linguists would also argue for the need to look at vocabulary associated with agriculture and material culture, words that the Swadesh list deliberately omits. In a sense, Swadesh lists are the 'standardised testing' of historical linguistics, designed to make quick and 'consistent' comparisons by omitting large amounts of information and disregarding any subtle nuances in the data. A study that uses data drawn solely from Swadesh lists is inevitably going to be woefully inadequate, just like education policies based entirely on the results of standardised testing.

Words frozen in time?

Coming back to Pagel et al's work, which I now have the overwhelming desire to call the 'Ice Age language study', I hope you can start to see some of the problems with their methodology. Now I'm certainly not saying that their methodology is as basic as my friend's casual linguistic comparison of what are essentially false cognates (pairs of words with similar pronunciation and meaning but very different sources) in her language and Yucatec Maya.

Nevertheless there are issues with their study, as listed here:

(1) They only use Swadesh list data.
(2) There are a number of inaccuracies in the data used to reconstruct certain proto-words, as noted by Thomason.
(3) They apply the Comparative Method to reconstructed proto-words, which are themselves hypothetical and disputable, to reconstruct even older proto-words. (Note: this is acceptable, but only if your first reconstructions are solid.)
(4) There are some questionable judgements about which words to treat as cognates, although this is always going to be a subject of debate in any historical linguistic research. Some linguists simply err on the side of caution, while others are more liberal in their judgements.

It should be obvious by now that this is not an exact science - you can apply all the statistics you want, but if the initial data is based on somewhat subjective judgements, the results of the statistical analysis are not going to be very convincing. To their credit though, they try to show that the rate of word replacement can be correlated with frequency of use, and provide a more empirically-based study than what Swadesh did, even if this study is based on just 200 items on the Swadesh list.

Personally, I find questions about the origins of language families fascinating because they are intimately linked to human migration in prehistoric times, and going back deep enough, to our origins as a species. Judging by the amount of media coverage, this also seems to be an issue that media outlets believe people are interested in reading about. All that I've said doesn't mean that I don't believe that a super 'Eurasiatic' / 'Ice Age' language could have ever existed - I'm certainly in no position to say if one did or did not. I just don't think the evidence provided is compelling enough to suggest that one did. And given the time depth we are talking about, it's doubtful that we'll be able to recognise true cognates using the Comparative Method.

I don't think linguistics by itself will be able to give any satisfying conclusions about our origins, or about prehistoric human migration. But this doesn't mean that we should abandon the collection of linguistic data altogether. Comparative work like this calls for a lot more subtle attention to detail than lists of 200 words. Linguists, such as Roger Blench and George van Driem have also increasingly started to collaborate with anthropologists, archaeologists and geneticists to try and corroborate findings for each field in order to provide a better picture of our prehistoric movements.  More sophisticated statistical, genetic and geography-based computer modelling are also being developed and some are being applied to linguistic data. With any luck, some of these will bring promising results in the future.


  1. So are the similarities in sound and meaning between words in Yukatek Maya and your Nagaland friend's native language a complete coincidence? Or are you saying that they are much more recent borrowings that don't indicate a historical, genetic relationship between the languages?

    1. Yes, so the similarities in sound and meaning are completely by coincidence. A very common example linguists like to cite is the word for 'dog' in Mbabaram, a language that used to be spoken in Queensland, is 'dog'. It's not a borrowed word from English, but has cognates in other related languages, e.g. Yidiny 'gudaga' which show that it derived from another source and just coincidentally ended up looking like the English word.

      They're examples of 'false cognates', which are sometimes interpreted as (a) some evidence that two languages are related or (b) evidence that one language borrowed the word from the other, e.g. the widespread (but incorrect) belief that the Japanese didn't have a word for 'thank you' until the Portuguese came - owing to the similarity in form between Japanese 'arigato' and Portuguese 'obrigado'.

  2. Good points about false cognates!
    I have also criticized that study of Pagel et al: