Saturday, May 25, 2013

Fun with tone sandhi

The past few months, I've been learning a language here in Singapore that's been noted for its crazy mind-bending use of tone sandhi. I thought I'd write a little about it in this post, since it's a phenomenon that some linguists may not be familiar with (given the tendency for many to run away at the first 'hearing' of anything tonal). At the end of this post, I'm also going to throw in a little puzzle set that I created, just to give people a chance to see the sorts of data some linguists work with. I'm hoping it'll appeal to all the puzzle solvers out there.


Tone sandhi in Mandarin Chinese
Experienced learners of Mandarin will already be familiar with the phenomenon, exemplified by the initially confusing and dreaded rule that specifies that Tone 3 becomes Tone 2 before another Tone 3. This prevents you from saying two Tone 3s, one after the other. For example, the word for 'you' in Mandarin is 你 nǐ (with Tone 3) when said on its own and the word for '(to be) good' is 好 hǎo (also Tone 3). However, when you put them together to get the ubiquitous Mandarin greeting 你好, written as  hǎo in Pinyin, you find that 你 is now pronounced with Tone 2. (This makes it homophonous with 泥 'mud', but most speakers can work out from context that you're not talking about the quality of earth.)

Importantly, the rule applies whenever two Tone 3s occur next to each other in the same phrase, regardless of the actual meaning of the words. Using another example, 很 hěn, an intensifier with the meaning of 'very', remains as Tone 3 in phrases like 很多 hěn duō 'a lot' and 很快 hěn kuài 'very fast', since 多 duō has Tone 1 and 快 kuài 'has Tone 4. But if you want to say 很好 hěn hǎo 'very good', you would have to pronounce 很 as hén, with Tone 2.

Ask a native speaker of Mandarin why on God's less-than-green earth they would say 你好 or 很好 this way, and they'll probably just say that 'it sounds nicer'. There's also actually no physiological, or aesthetic, reason preventing you from producing two Tone 3s in a row. The thing is, tone sandhi rules are language-specific: some tone languages do allow sequences of similarly low (and creaky) tones to occur next to each other, while others may disallow sequences of two falling tones, which Mandarin does allow.

Of course, if you're only interested in learning a tone language that does have tone sandhi, it doesn't really help to ask why it happens, or for instance, why Tone 3 becomes Tone 2 and not Tone 4. You just need to accept that it does happen and that it happens the way it does. And then you need to learn how to apply the tone sandhi rules in actual speech so you don't sound completely moronic.


Tone sandhi vs Tone change
On the other hand, if you're in the business of describing tonal languages, tone sandhi is something that pops up again and again. It can sometimes be a little tricky to talk about, since there's still some disagreement as to how to what the term 'tone sandhi', sometimes called 变调 biàndiào in Mandarin, should include. At least, it is generally accepted that 'tone sandhi' differs from 'tone change', or 变音 biànyīn, which describes similar kinds of tone alternations that are restricted to specific words, largely due to historical reasons. For example, 好 when pronounced hào with Tone 4, means 'to be fond of' (example taken from Chen 2000: 31) - here you can see the connection with 好 hǎo '(to be) good', which indicates a likeable quality. However, this correspondence between Tone 3 and Tone 4 is specific to 好, and changing Tone 3 on another word to Tone 4 is not likely to yield a similar change in meaning.

In contrast, tone sandhi rules, which can also be the products of historical changes in a language, are more 'general', in the sense that they almost always apply regardless of the meaning of words as long as the necessary sound environment condition is present. However, there are instances when tone sandhi rules are not strictly observed - even native Mandarin speakers may sometimes fail to observe the rule described above when confronted with new compound words consisting of Tone 3 + Tone 3.


A tone sandhi puzzle
In the process of learning this tonal language in Singapore, which I'm calling 'Language X' for the moment, I came up with a little puzzle involving tone sandhi. It's similar to the problem sets we give out to undergraduate linguistic students, except I've simplified it a little so you don't need a lot of linguistic knowledge to solve it. I've used the letters A-G to indicate the tones, as well as some symbols known as Chao tone letters which give a visual representation of the tones. The 'stopped' tones refer to tones on words that end in the consonants k and h.

You can view a draft of the puzzle below. Now this may not be the easiest puzzle to cut your linguistics teeth on, but I hope it gives you a taste of the sorts of data linguists work with, and the kind of analytic skills required to describe languages.

(Right click the image below and select 'Open Image in New Tab'.
Or click here for an image you can magnify.) 


The solution will come in mid-June!

[I may have to post less frequently than I already do this coming month because I'm busy revising my Masters thesis to get it published.]


Reference
Chen, Matthew Y. 2000. Tone sandhi: Patterns across Chinese dialects. Cambridge: Cambridge University Press.

Tuesday, May 14, 2013

Issues with Ice Age linguistics

Last week I had a few friends ask me about a recently published study titled "Ultraconserved words point to deep language ancestry across Eurasia" by Mark Pagel, Quentin D. Atkinson, Andreea S. Calude and Andrew Meade. It's been making headlines all over the globe in articles with titles like "English May Have Retained Words From an Ice Age Language" (Wired.com), "Ice Age language may share words with modern tongues" (News.com.au and various sites) and "15000-year-old 'fossil' words reveal ancestral Ice Age language" (LA Times).

You can download their report here. Also, the data for the study comes from the Languages of the World Etymological Database, which can be accessed at this site.

As always, Language Log has a great post by Sally Thomason that highlights many of the issues about the study here, including issues with both the data and methodology. Similarly, another post at GeoCurrents by Asya Pereltsvaig rubbishes the study.

Now, before you go and cry 'Academics marking territory!', there are very good reasons to take the study by Pagel et al. with a sea-ful of salt. But let me start with a short personal anecdote and brief introduction into the world of historical linguistics. Also, if you're a believer in Nostratic, you should probably just ignore this post altogether.


Nagaland and the Yucatec Peninsula?

A few years ago, a friend of mine from Nagaland in North-East India saw Mel Gibson's Apocalypto and was astounded that her language and Mayan (technically, Yucatec Maya) shared a number of words in common. She thought the two languages might be related and asked me about it. I told her this was highly unlikely given (a) the geographic distance between the two and (b) the lack of any recent contact between the people of Nagaland and the Mayans. Of course, I could tell she was still sceptical of my response even some time after.

Now my dismissal of her theory wasn't just because I found the geographic distance and lack of recent contact problematic (or the fact that she was basing her observations on translations given in the subtitles). It was the fact that given the geographic distance and the lack of recent contact, the words she cited were just too similar in both pronunciation and meaning. Such similarity between cognates, that is, words in related languages that are descended from the same etymological source (and not through borrowing), is actually highly unlikely. Such words rarely keep both their original form and meaning as time goes by, and the languages they belong to drift apart. As an example, let's look at the Italian word for 'dog': cane (pronounced /ka.ne/, like 'car-nay' with a [k] sound at the start). The French equivalent is chien (pronounced /ʃjɛ̃/ with a sound usually written in English as sh). Despite both words deriving from Latin canis, the modern equivalents in Italian and French sound quite different.


Historical Linguistics 101




(Image by Koryakov Yuri, taken from Wikimedia Commons)

To address this problem of sound change, most historical linguists apply what is known as the Comparative Method. The idea is to look for sound correspondences across a number of words in two languages, and not just individual words in each language that sound identical and mean the same thing. Applying this method reveals that the /ʃ/ 'sh' sound in French (written as ch) regularly corresponds to a 'k' sound in Italian (written as c): compare French chanter with Italian cantare 'to sing', French bouche with Italian bocca 'mouth'. It is these regular sound correspondences that form the basis for genetic groupings of languages, not similarities in the actual forms of the words themselves. Historical linguists will then use these sound correspondences to attempt to reconstruct a 'proto-language' from the forms in the modern languages. Such proto-languages are always theoretical - even 'proto-Romance', a proto-language reconstructed based on modern Romance languages like Spanish, Sardinian and Romanian, is not identical to Vulgar Latin, which had many varieties spoken in across the Roman Empire.

However, even before historical linguists can begin to establish sound correspondences, they first need to identify cognates in various languages. This process of identification is complicated by the fact that words don't just change in pronunciation, they also change in meaning. For example, English dog and Swedish hund /hɵnd/ 'dog' sound nothing alike, even though they share the same meaning. On the other hand, English hound /haʊnd/ and Swedish hund share many similarities in pronunciation, with similar consonants both at the start and end of each word. However, Swedish hund refers to any kind of dog, while English hound refers to only a specific breed of dog. Which word in English would we say is cognate with Swedish hund then? Given the similarity in pronunciation and the somewhat related meaning, hound is the more likely answer.

Now this may not look like a huge semantic leap that could cause much confusion, but a combination of both sound drift and semantic drift can make it difficult to locate cognates. Take for instance, the Swedish word for 'animal', pronounced /jʉːr/, almost like English you're. Based on this spoken form, can you think of a word in English that might be cognate with this?

Unless you know something about proto-Germanic linguistics, I'm guessing that you probably weren't able to work out that the Swedish word for 'animal'written as djur, is actually cognate with English deer. (Yes, the spelling might have helped, but imagine you're working with languages that have no written records.) The word deer in English does not refer to animals in general, but to a specific kind of animal, somewhat analogous to English hound. Speakers of German may have seen the connection, since German Tier means 'animal (in general)' and still sounds similar to English deer. However, the point here is that as languages diverge more over time, the task of identifying cognates between them gets increasingly difficult.

Certain types of sound and semantic change are quite common, and follow well-established patterns. For example, in a number of languages, the word for 'five' is historically derived from the word for 'hand': compare Malay lima 'five' with Hawaiian lima 'hand' (see here for more words for 'hand' in Austronesian languages). However, the rules governing such changes are not necessarily predictive, and at best can only give a probability that a word developed from a particular source. This is when historical linguists can get rather creative in deciding whether two words are cognates or not - disagreements over what words should be used as cognates can lead to rather different reconstructions of what is supposed to be the same hypothetical proto-language.


Swooning over Swadesh lists

To help identify cognates, many linguists start by comparing items from Swadesh lists in various languages. The list was first developed by Morris Swadesh in the 1940s and 50s and contains words that are viewed as belonging to the 'core vocabulary' of all languages, as opposed to culturally-specific vocabulary. Depending on the version of the list, there may be 100 or sometimes up to more than 200 items on the list. The items include nouns referring to body parts like 'heart' and 'tooth', personal pronouns like 'I' and 'we', kinship terms like 'father' and 'mother', some verbs of motion, the numerals 1-5, etc. It was originally assumed that such 'core vocabulary' was more stable over time and underwent replacement by other words in the language at a slow but constant rate, analogous to the process of radioactive decay. Furthermore, there was the implicit belief that words for such 'basic' concepts were not likely to be borrowed from other languages.

Based on such assumptions, Swadesh applied a method called glottochronology to these word lists, which then allowed him to propose dates for when various languages / language families split from each other. Today, this method has been largely discredited, mainly for its flawed assumption that word replacement happens at a steady rate across languages and across all words in a language - although there do remain proponents of this type of research. Furthermore, 'core vocabulary' is not always resistant to replacement by borrowed words. One notable example of this is the adoption of the Chinese numeral system in the genetically unrelated Japanese, Thai and Vietnamese languages.

Despite all these limitations, many field linguists and historical linguists see the Swadesh list as a useful starting point, myself included. But any decent fieldworker or historical linguist would also know that you need to move beyond a Swadesh list consisting of some 200 items (at the maximum) if you want to get any real insight into a language and its past. One needs to go beyond studying the etymology of only 'core vocabulary' and look at other areas like morphology (e.g. prefixes and suffixes), syntax, as well as sociolinguistic variation. Some linguists would also argue for the need to look at vocabulary associated with agriculture and material culture, words that the Swadesh list deliberately omits. In a sense, Swadesh lists are the 'standardised testing' of historical linguistics, designed to make quick and 'consistent' comparisons by omitting large amounts of information and disregarding any subtle nuances in the data. A study that uses data drawn solely from Swadesh lists is inevitably going to be woefully inadequate, just like education policies based entirely on the results of standardised testing.


Words frozen in time?

Coming back to Pagel et al's work, which I now have the overwhelming desire to call the 'Ice Age language study', I hope you can start to see some of the problems with their methodology. Now I'm certainly not saying that their methodology is as basic as my friend's casual linguistic comparison of what are essentially false cognates (pairs of words with similar pronunciation and meaning but very different sources) in her language and Yucatec Maya.

Nevertheless there are issues with their study, as listed here:

(1) They only use Swadesh list data.
(2) There are a number of inaccuracies in the data used to reconstruct certain proto-words, as noted by Thomason.
(3) They apply the Comparative Method to reconstructed proto-words, which are themselves hypothetical and disputable, to reconstruct even older proto-words. (Note: this is acceptable, but only if your first reconstructions are solid.)
(4) There are some questionable judgements about which words to treat as cognates, although this is always going to be a subject of debate in any historical linguistic research. Some linguists simply err on the side of caution, while others are more liberal in their judgements.

It should be obvious by now that this is not an exact science - you can apply all the statistics you want, but if the initial data is based on somewhat subjective judgements, the results of the statistical analysis are not going to be very convincing. To their credit though, they try to show that the rate of word replacement can be correlated with frequency of use, and provide a more empirically-based study than what Swadesh did, even if this study is based on just 200 items on the Swadesh list.

Personally, I find questions about the origins of language families fascinating because they are intimately linked to human migration in prehistoric times, and going back deep enough, to our origins as a species. Judging by the amount of media coverage, this also seems to be an issue that media outlets believe people are interested in reading about. All that I've said doesn't mean that I don't believe that a super 'Eurasiatic' / 'Ice Age' language could have ever existed - I'm certainly in no position to say if one did or did not. I just don't think the evidence provided is compelling enough to suggest that one did. And given the time depth we are talking about, it's doubtful that we'll be able to recognise true cognates using the Comparative Method.

I don't think linguistics by itself will be able to give any satisfying conclusions about our origins, or about prehistoric human migration. But this doesn't mean that we should abandon the collection of linguistic data altogether. Comparative work like this calls for a lot more subtle attention to detail than lists of 200 words. Linguists, such as Roger Blench and George van Driem have also increasingly started to collaborate with anthropologists, archaeologists and geneticists to try and corroborate findings for each field in order to provide a better picture of our prehistoric movements.  More sophisticated statistical, genetic and geography-based computer modelling are also being developed and some are being applied to linguistic data. With any luck, some of these will bring promising results in the future.

Saturday, May 4, 2013

What a 'hotel' can mean in India

According to the Online Etymology Dictionary, the English word hotel was first recorded in the 1640s and denoted a 'public official residence'. The modern sense of the word as 'an inn of the better sort' (i.e. 'a place offering lodging, food and other services to travellers') was first recorded in 1765. The word comes from the French hôtel, which itself is derived from the Medieval Latin hospitale via Old French hostel.

In French, hôtel was used to refer mainly to public official buildings that frequently received visitors, but this has been largely replaced by the meaning of 'place offering lodging and food to travellers', as used in contemporary English. However, you can still see traces of this old usage in words like hôtel de ville 'town hall' and hôtel des impôts 'tax office' and hôtel de police 'police headquarters'.

In India, the term hotel has taken on a slightly different meaning (and pronunciation, with stress on the first syllable, not the second.) Visitors to India are likely to find that big modern buildings offering lodging are called 'hotels', but they might be slightly shocked to see signs for hotels that do not provide lodging at all.

Take for instance this hotel located right next to the Dimapur Railway Station. As you can see, the hotel only offers 'fooding', a very common term in Indian English meaning 'the provision of food' - this can include the catering at an event or simply selling food at a restaurant.

Next to Dimapur Railway Station

I'm not entirely certain how the term 'hotel' has come to be used to refer to (what I would call) a 'restaurant', where only food and no lodging is provided. I doubt that this use derives from the original French meaning of a public building that frequently receives visitors. Incidentally, there are also hotels in India that advertise 'only lodging' with no 'fooding'.

My guess is that the term did originally designate a place frequented by travellers and provided both food and lodging - I imagine that travellers were the most likely people to frequent places offering food since most people would have taken their meals at home or packed their own food. Over time, some establishments may have stopped providing one service or the other for whatever reason (e.g. greater profits from selling food), but the label 'hotel' remained. Consequently, the term 'hotel' no longer denoted a place of lodging, but simply a place frequented by travellers. Someone else starting a restaurant near a train station or along a highway may then choose to call their business a 'hotel', even though they have no intention of providing lodging, as long as their expected clientele are likely to be travellers stopping in for a meal.

Whatever the history of the word may be, don't be shocked if you rock up to a hotel in India and can't get a room - some of them simply don't have any for guests!

Wednesday, May 1, 2013

Taiwan indigenous languages on television

One of the things I was impressed with during my short stay in Taiwan was the Taiwan Indigenous Television (TITV) channel, which features programming for and by indigenous peoples of Taiwan, including news programmes, educational shows and variety shows.

Here's a screenshot of a programme that is in (what I assume to be) the Seediq language (sometimes still classified with Atayal).



And here's a screenshot of a news programme in what I assumed was the Amis language ('Pangcah' is the endonym for the group). Although I couldn't understand what they were saying, I did see that the story they were running was about Julia Gillard's March 21 apology to victims of forced adoptions in Australia. Her apology echoed Kevin Rudd's 2008 apology to victims of the Stolen Generation, a policy which I believe has some resonance among the indigenous people of Taiwan, given their own experience of institutionalised racism.



Speaking of the Amis language, most people around the world would have actually heard bits of an Amis song without even realising it. Remember Enigma's 'Return to Innocence', which was used in ads promoting the 1996 Summer Olympics in Atlanta?


The 'chant' you hear right at the start, then throughout the song was actually sampled from a recording of a tradition Amis song titled 'Elders' Drinking Song' (or 'Jubilant Drinking Song' or 'Weeding and Paddyfield Song No. 1'), as performed by Difang Duana and Igay Duana (also known as Kuo Ying-nan and Kuo Hsiu-chu) while they were in Paris on a cultural tour in 1988.

The Maison des Cultures du Monde in France recorded the husband and wife duo, along with 30 other visiting artistes, and created a compilation titled 'Polyphonies vocales des aborigènes de Taïwan'. However, they failed to properly credit the Duanas and their compatriots. Michael Cretu, the producer of Enigma, was later sued for not giving proper credit to the original singers, stating that he had assumed the recordings belonged to the public domain and thus were not subject to intellectual property rights. The case was eventually settled out of court. [Click here for more details.]


Sadly, Difang and Igay Duana both passed away in 2002, but I found a video of them on Youtube taken in 2001 singing a bit of the song in their garage. You can still purchase their 1998 album Circle of Life on iTunes which features the famous song and was produced by Rock Records.


This issue of 'indigenous intellectual property rights' is a very tricky but important one for people involved in the documentation of language, indigenous art forms and scientific knowledge. For one thing, the 'owner' of such intellectual property is rarely a single individual or two but rather an entire 'community' - itself a problematic notion that does not always correspond to a stable cohesive unity. [Click here for Gawne and Kelly's presentation at this year's International Conference on Language Documentation and Conservation.] At the same time however, there is seen to be a need to safeguard such intellectual property (be they songs or botanical knowledge) from what could be called 'exploitation' by outsiders who offer no compensation to any members of that community (even if it is difficult to determine what one considers exactly to be 'exploitation').

Perhaps having a television channel that advocates for indigenous rights and which produces and airs programmes related to traditional culture is one way of documenting and showcasing traditional art forms to a wide audience without the threat of 'exploitation' - in fact, many of the educational programmes on TITV are aimed specifically at imparting traditional knowledge to children belonging to the relevant indigenous community.

Of course this is not a solution for all indigenous peoples, even within Taiwan, especially for smaller groups with insufficient resources and viewers. Furthermore, as we see people move away from more 'traditional' forms of media like radio and television to online media, the nature of the debate surrounding indigenous intellectual property rights will undoubtedly continue to change.