Consonant Aspirations: historical linguistics

Showing posts with label historical linguistics. Show all posts

Tuesday, May 14, 2013

Issues with Ice Age linguistics

Last week I had a few friends ask me about a recently published study titled "Ultraconserved words point to deep language ancestry across Eurasia" by Mark Pagel, Quentin D. Atkinson, Andreea S. Calude and Andrew Meade. It's been making headlines all over the globe in articles with titles like "English May Have Retained Words From an Ice Age Language" (Wired.com), "Ice Age language may share words with modern tongues" (News.com.au and various sites) and "15000-year-old 'fossil' words reveal ancestral Ice Age language" (LA Times).

You can download their report here. Also, the data for the study comes from the Languages of the World Etymological Database, which can be accessed at this site.

As always, Language Log has a great post by Sally Thomason that highlights many of the issues about the study here, including issues with both the data and methodology. Similarly, another post at GeoCurrents by Asya Pereltsvaig rubbishes the study.

Now, before you go and cry 'Academics marking territory!', there are very good reasons to take the study by Pagel et al. with a sea-ful of salt. But let me start with a short personal anecdote and brief introduction into the world of historical linguistics. Also, if you're a believer in Nostratic, you should probably just ignore this post altogether.

Nagaland and the Yucatec Peninsula?

A few years ago, a friend of mine from Nagaland in North-East India saw Mel Gibson's Apocalypto and was astounded that her language and Mayan (technically, Yucatec Maya) shared a number of words in common. She thought the two languages might be related and asked me about it. I told her this was highly unlikely given (a) the geographic distance between the two and (b) the lack of any recent contact between the people of Nagaland and the Mayans. Of course, I could tell she was still sceptical of my response even some time after.

Now my dismissal of her theory wasn't just because I found the geographic distance and lack of recent contact problematic (or the fact that she was basing her observations on translations given in the subtitles). It was the fact that given the geographic distance and the lack of recent contact, the words she cited were just too similar in both pronunciation and meaning. Such similarity between cognates, that is, words in related languages that are descended from the same etymological source (and not through borrowing), is actually highly unlikely. Such words rarely keep both their original form and meaning as time goes by, and the languages they belong to drift apart. As an example, let's look at the Italian word for 'dog': cane (pronounced /ka.ne/, like 'car-nay' with a [k] sound at the start). The French equivalent is chien (pronounced /ʃjɛ̃/ with a sound usually written in English as sh). Despite both words deriving from Latin canis, the modern equivalents in Italian and French sound quite different.

Historical Linguistics 101

(Image by Koryakov Yuri, taken from Wikimedia Commons)

To address this problem of sound change, most historical linguists apply what is known as the Comparative Method. The idea is to look for sound correspondences across a number of words in two languages, and not just individual words in each language that sound identical and mean the same thing. Applying this method reveals that the /ʃ/ 'sh' sound in French (written as ch) regularly corresponds to a 'k' sound in Italian (written as c): compare French chanter with Italian cantare 'to sing', French bouche with Italian bocca 'mouth'. It is these regular sound correspondences that form the basis for genetic groupings of languages, not similarities in the actual forms of the words themselves. Historical linguists will then use these sound correspondences to attempt to reconstruct a 'proto-language' from the forms in the modern languages. Such proto-languages are always theoretical - even 'proto-Romance', a proto-language reconstructed based on modern Romance languages like Spanish, Sardinian and Romanian, is not identical to Vulgar Latin, which had many varieties spoken in across the Roman Empire.

However, even before historical linguists can begin to establish sound correspondences, they first need to identify cognates in various languages. This process of identification is complicated by the fact that words don't just change in pronunciation, they also change in meaning. For example, English dog and Swedish hund /hɵnd/ 'dog' sound nothing alike, even though they share the same meaning. On the other hand, English hound /haʊnd/ and Swedish hund share many similarities in pronunciation, with similar consonants both at the start and end of each word. However, Swedish hund refers to any kind of dog, while English hound refers to only a specific breed of dog. Which word in English would we say is cognate with Swedish hund then? Given the similarity in pronunciation and the somewhat related meaning, hound is the more likely answer.

Now this may not look like a huge semantic leap that could cause much confusion, but a combination of both sound drift and semantic drift can make it difficult to locate cognates. Take for instance, the Swedish word for 'animal', pronounced /jʉːr/, almost like English you're. Based on this spoken form, can you think of a word in English that might be cognate with this?

Unless you know something about proto-Germanic linguistics, I'm guessing that you probably weren't able to work out that the Swedish word for 'animal', written as djur, is actually cognate with English deer. (Yes, the spelling might have helped, but imagine you're working with languages that have no written records.) The word deer in English does not refer to animals in general, but to a specific kind of animal, somewhat analogous to English hound. Speakers of German may have seen the connection, since German Tier means 'animal (in general)' and still sounds similar to English deer. However, the point here is that as languages diverge more over time, the task of identifying cognates between them gets increasingly difficult.

Certain types of sound and semantic change are quite common, and follow well-established patterns. For example, in a number of languages, the word for 'five' is historically derived from the word for 'hand': compare Malay lima 'five' with Hawaiian lima 'hand' (see here for more words for 'hand' in Austronesian languages). However, the rules governing such changes are not necessarily predictive, and at best can only give a probability that a word developed from a particular source. This is when historical linguists can get rather creative in deciding whether two words are cognates or not - disagreements over what words should be used as cognates can lead to rather different reconstructions of what is supposed to be the same hypothetical proto-language.

Swooning over Swadesh lists

To help identify cognates, many linguists start by comparing items from Swadesh lists in various languages. The list was first developed by Morris Swadesh in the 1940s and 50s and contains words that are viewed as belonging to the 'core vocabulary' of all languages, as opposed to culturally-specific vocabulary. Depending on the version of the list, there may be 100 or sometimes up to more than 200 items on the list. The items include nouns referring to body parts like 'heart' and 'tooth', personal pronouns like 'I' and 'we', kinship terms like 'father' and 'mother', some verbs of motion, the numerals 1-5, etc. It was originally assumed that such 'core vocabulary' was more stable over time and underwent replacement by other words in the language at a slow but constant rate, analogous to the process of radioactive decay. Furthermore, there was the implicit belief that words for such 'basic' concepts were not likely to be borrowed from other languages.

Based on such assumptions, Swadesh applied a method called glottochronology to these word lists, which then allowed him to propose dates for when various languages / language families split from each other. Today, this method has been largely discredited, mainly for its flawed assumption that word replacement happens at a steady rate across languages and across all words in a language - although there do remain proponents of this type of research. Furthermore, 'core vocabulary' is not always resistant to replacement by borrowed words. One notable example of this is the adoption of the Chinese numeral system in the genetically unrelated Japanese, Thai and Vietnamese languages.

Despite all these limitations, many field linguists and historical linguists see the Swadesh list as a useful starting point, myself included. But any decent fieldworker or historical linguist would also know that you need to move beyond a Swadesh list consisting of some 200 items (at the maximum) if you want to get any real insight into a language and its past. One needs to go beyond studying the etymology of only 'core vocabulary' and look at other areas like morphology (e.g. prefixes and suffixes), syntax, as well as sociolinguistic variation. Some linguists would also argue for the need to look at vocabulary associated with agriculture and material culture, words that the Swadesh list deliberately omits. In a sense, Swadesh lists are the 'standardised testing' of historical linguistics, designed to make quick and 'consistent' comparisons by omitting large amounts of information and disregarding any subtle nuances in the data. A study that uses data drawn solely from Swadesh lists is inevitably going to be woefully inadequate, just like education policies based entirely on the results of standardised testing.

Words frozen in time?

Coming back to Pagel et al's work, which I now have the overwhelming desire to call the 'Ice Age language study', I hope you can start to see some of the problems with their methodology. Now I'm certainly not saying that their methodology is as basic as my friend's casual linguistic comparison of what are essentially false cognates (pairs of words with similar pronunciation and meaning but very different sources) in her language and Yucatec Maya.

Nevertheless there are issues with their study, as listed here:

(1) They only use Swadesh list data.
(2) There are a number of inaccuracies in the data used to reconstruct certain proto-words, as noted by Thomason.
(3) They apply the Comparative Method to reconstructed proto-words, which are themselves hypothetical and disputable, to reconstruct even older proto-words. (Note: this is acceptable, but only if your first reconstructions are solid.)
(4) There are some questionable judgements about which words to treat as cognates, although this is always going to be a subject of debate in any historical linguistic research. Some linguists simply err on the side of caution, while others are more liberal in their judgements.

It should be obvious by now that this is not an exact science - you can apply all the statistics you want, but if the initial data is based on somewhat subjective judgements, the results of the statistical analysis are not going to be very convincing. To their credit though, they try to show that the rate of word replacement can be correlated with frequency of use, and provide a more empirically-based study than what Swadesh did, even if this study is based on just 200 items on the Swadesh list.

Personally, I find questions about the origins of language families fascinating because they are intimately linked to human migration in prehistoric times, and going back deep enough, to our origins as a species. Judging by the amount of media coverage, this also seems to be an issue that media outlets believe people are interested in reading about. All that I've said doesn't mean that I don't believe that a super 'Eurasiatic' / 'Ice Age' language could have ever existed - I'm certainly in no position to say if one did or did not. I just don't think the evidence provided is compelling enough to suggest that one did. And given the time depth we are talking about, it's doubtful that we'll be able to recognise true cognates using the Comparative Method.

I don't think linguistics by itself will be able to give any satisfying conclusions about our origins, or about prehistoric human migration. But this doesn't mean that we should abandon the collection of linguistic data altogether. Comparative work like this calls for a lot more subtle attention to detail than lists of 200 words. Linguists, such as Roger Blench and George van Driem have also increasingly started to collaborate with anthropologists, archaeologists and geneticists to try and corroborate findings for each field in order to provide a better picture of our prehistoric movements. More sophisticated statistical, genetic and geography-based computer modelling are also being developed and some are being applied to linguistic data. With any luck, some of these will bring promising results in the future.

Tuesday, November 1, 2011

Tea vs Chai, the Tekka Centre and my last name (II)

In yesterday's post I talked about the correspondence between Hokkien 't' and Mandarin 'zh' (a retroflex sound produced with the tongue slightly further back than the sound represented by 'ch' in 'chunk' and without the puff of air). Both sounds are descended from an earlier 'tr' cluster in Early Middle Chinese, as reconstructed by historical linguists.

What does this have to do with the word for 'tea'?

People who know Hindi, may laugh surreptitiously when they hear people order a 'chai tea', since चय chay means 'tea' in Hindi, so the order is basically for a 'tea tea'. In English though, 'chai tea' is perfectly acceptable because the word 'chai' has been borrowed to designate what one would call मसाला चय masaalaa chai 'spiced tea' in India.

The Hindi word for tea is चय chay is much closer to the Mandarin cha (the 'ch' sound here is pronounced like the retoflex 'zh', the only difference is that it is accompanied by a puff of air). Other Indo-Aryan languages like Nepali have चिया chiyā. Within Indo-European, we also have Russian чай chay. The Japanese also use cha. In contrast, English has tea, French thé and Malay teh. Hebrew too uses תה te (I was taught that תה נענע te nana is '(spear)mint tea' in Hebrew). These languages all have a word for 'tea' that's closer to the Hokkien / Minnan word te (tone not given).

The reason for this difference is that languages like English borrowed (whether directly or indirectly) the word from one of the Minnan dialects / languages, while languages like Russian and Hindi borrowed the word from other Chinese languages like Mandarin or Cantonese. The Wikipedia article explains this in greater detail and gives more examples from other languages.

Etymologically though, Mandarin cha and Hokkien te share the same origin. Pulleyblank (1991) gives the reconstructed forms draɨ /drɛ (Early Middle Chinese) and trɦa: (Late Middle Chinese). Again, we see the correspondence between the Mandarin retroflex sounds (written in pinyin as 'zh' and 'ch') and Minnan 't',

So voilà, it took me two posts to do it, but there you have it - the common thread linking my last name, the name of the Tekka Centre and the name of one of the most consumed beverages on the planet.

[This post was inspired by 3 separate conversations I've had in the last month about each of these topics. Tomorrow I'm off to the great tea-growing state of Assam in NE India. The word in Assamese চাহ (transliterated as chah) is clearly related to the non-Minnan form of the word, but is now pronounced 'sah' in Assamese. Something for me to get used to saying again!]

Monday, October 31, 2011

Tea vs Chai, the Tekka Centre and my last name (I)

This post is about three things: (1) the name of a very popular beverage that the vast majority of readers would be familiar with; (2) the name of a building complex near Little India that most Singaporeans would be familiar with; and (3) my family name, which only my friends would be familiar with (but which is actually a pretty common Chinese name around the world).

And yes, there's a linguistic point to all of this.

Let's start with what's least familiar: my last name, which happens to be Teo (I pronounce it as [t^hjo]). It is a Hokkine / Minnan name that has its origins in southern China. While it may not look familiar to most people outside SE Asia, it's actually etymologically related to one of the most common Chinese surnames around the world. The Chinese character used to write it is 张 (simplified) or 張 (traditional). The standard Mandarin equivalent is transliterated as Zhang in pinyin and pronounced as /tʂaŋ/ (tone not given) - it's like saying 'chunk', but (a) you don't have a final 'k' sound and (b) when you pronounce 'ch' sound, your tongue curls back a bit (this is what is called a 'retroflex' sound) and you shouldn't have a strong puff of air. The Cantonese equivalents I believe are transliterated as Cheung, Cheong or Chong, depending on the transliteration system.

Most of you will probably have started to recognise these names and probably even know people with one of these names. But you've also probably noticed that while the Mandarin and Cantonese forms look quite similar, the Minnan name Teo doesn't look (or sound) anything like the others. So how is it related?

Before I get to that point, let's look at the name of a famous building complex located in Little India, Singapore: the Tekka Centre. (I was just there a week ago with a friend from Australia.)

The Wikipedia article gives the original (Hokkien) name of the market as Tek Kia Kha, meaning 'foot of the small bamboos' which was eventually shortened to Tekka. For those who can read Chinese, you'll notice on the right the Chinese characters 竹 'bamboo' and 脚 'leg / foot'. The standard Mandarin reading of 竹 is zhu in pinyin and pronounced /tʂu/, while in Hokkien 竹 is transliterated as tek and pronounced something like /tɛk/.

Now I remember going on a school trip to Little India in the 1990s and being utterly confused because the centre had been renamed the 'Zhujiao Centre' to match the Mandarin reading of 竹脚. In reality, almost everyone still referred to it as the 'Tekka Market'. The building has since been renamed the 'Tekka Centre' to avoid confusion (which Wikipedia tells me happened in 2000).

The point was, I could see no resemblance between Mandarin zhu and Hokkien tek. Since then, I've also learnt a lot more about historical sound changes, and noticed other examples of Mandarin 'zh' (a retroflex sound) corresponding to Hokkien 't', like with my last name. Simply put, they are both said to have descended from a sequence of 't' and 'r' early in the history of Chinese. In the Minnan languages / dialects, including Hokkien, the 'r' sound was lost, while in other varieties, including standard Mandarin, the combination of 'tr' became a retroflex sound, as represented in pinyin by the letters zh. Pulleyblank (1991) reconstructs the pronunciation of 竹 as truwk in Early Middle Chinese and triwk in Late Middle Chinese. Guillaume Jacques here also gives 'tr' as an initial in Early Middle Chinese, with the pronunciation of 张 reconstructed as 'trjang'. We still see the 'tr' combination in the Vietnamese surname Trương / Truong. (Vietnamese is not Sinitic, but it was heavily influenced by it for centuries.)

Of course, this only explains how the first sounds in Teo and tek in Hokkien correspond to Zhang and zhu in Mandarin. To explain the rest would require more than a humble blog post.

So what does this have to do with all the tea in China (and all the chai in India)? Check out tomorrow's post.