love, play & inquiry (trochee) wrote,
love, play & inquiry

order word constant information does have?

I’ve been reading — in my “copious free time”, as they say:

Montemurro MA, Zanette DH, 2011 “Universal Entropy of Word Ordering Across Linguistic Families.” PLoS ONE 6(5): e19875. doi:10.1371/journal.pone.0019875

This article takes on some of the really nifty ideas of information theory, and asks one deceptively simple question:

how much information does word order contribute to human language?

It turns out that this question is not entirely straightforward to answer: languages vary tremendously in the information carried per-word.  Agglutinative languages like Turkish or Inuit, for example, are capable of encoding a tremendous number of separate information particles into a single word; inflecting languages (Latin, Hindi, Greek) encode information about how words relate to each other in a sentence right on the words themselves; but relatively isolating languages (English, and to a greater degree, Mandarin Chinese) tend to have fewer variants for a given word, but count on additional words and word order to encode the factors.

So one might expect, for example, that word order bears more information in a language like English or Mandarin, but less information in a highly-inflected or -agglutinative language like Latin or Turkish. But a quick summary of this study suggests that word order bears a constant information rate, even across these very large language differences, with their very large differences in the information content of individual words. Another way of saying this:

regardless of a language’s idea of how much information makes up a word, the surprisal of the next word is roughly constant.

Linguists often wrestle with how to define a word boundary — especially in agglutinative languages, of course, but even in languages with long-standing orthographic traditions: how many words is “I’m gonna get ‘em“?  How about “moi, je l’aime“?  A phonologist, a syntactician, and a professional copy-editor (not to mention a teacher with an essay word-count assignment) will probably disagree.  Matters are substantially worse if you’re a field linguist trying to record a dying agglutinative language, though, and particularly vulnerable to reductionist approaches.

“Maybe every little unit is its own word“, says the (eager-to-claim-territory) syntactician over your shoulder, while the imperialist morphologist may want to clump together every frequent short expression. The former would argue that English has words /ɪŋ/, /ɨd/ and /ɨz/ “with some trivial cross-word phonological effects, probably performance-related”; the latter would claim a quotative verb morpheme complex: /ʃiwʌzlɑjk/ (with alternate form /hiwʌzlɑjk/, “inflecting for quoted-entity’s gender”.  They each have a case, but they both feel, somehow, silly — like they’re overreaching (I hope you, dear reader, share this intuition. If you don’t, hassle me in the comments).

The Montemurro & Zanette result might make some morphologists nervous — or give the field workers some relief — it certainly gives me some. This result actually suggests an empirical answer to a question that is often answered as a matter of “elegance” or “taste” among linguists: a system for defining word boundaries.  A system of word construction in language L that offers roughly the same information across word-order as other languages M,N,P, and Q is to be preferred over one that offers — with respect to the next word —  too little information (the mad morphologist) or too much information (the mad syntactician).

It’s easy to take as dogma the idea that all human languages are equally capable of communication — I generally do — but it’s surprising and delightful to find that this dogma holds even in this subcomponent of language.

An afterthought: there are a lot of questions still to be explored in this paper. They’ve decided that Chinese “words”, for example, should be taken to be single characters, which is not necessarily a good decision. One must wonder if the constant-information conclusion they’ve come to is a question somehow begged by some hidden assumption lurking in the mechanisms of determining word boundaries in the texts they have.

Mirrored from Trochaisms.

Comments for this post were disabled by the author