order word constant information does have?

I’ve been reading — in my “copious free time”, as they say:

This article takes on some of the really nifty ideas of information theory, and asks one deceptively simple question:

how much information does word order contribute to human language?

It turns out that this question is not entirely straightforward to answer: languages vary tremendously in the information carried per-word.  Agglutinative languages like Turkish or Inuit, for example, are capable of encoding a tremendous number of separate information particles into a single word; inflecting languages (Latin, Hindi, Greek) encode information about how words relate to each other in a sentence right on the words themselves; but relatively isolating languages (English, and to a greater degree, Mandarin Chinese) tend to have fewer variants for a given word, but count on additional words and word order to encode the factors.

So one might expect, for example, that word order bears more information in a language like English or Mandarin, but less information in a highly-inflected or -agglutinative language like Latin or Turkish. But a quick summary of this study suggests that word order bears a constant information rate, even across these very large language differences, with their very large differences in the information content of individual words. Another way of saying this:

regardless of a language’s idea of how much information makes up a word, the surprisal of the next word is roughly constant.

Linguists often wrestle with how to define a word boundary — especially in agglutinative languages, of course, but even in languages with long-standing orthographic traditions: how many words is “I’m gonna get ‘em“?  How about “moi, je l’aime“?  A phonologist, a syntactician, and a professional copy-editor (not to mention a teacher with an essay word-count assignment) will probably disagree.  Matters are substantially worse if you’re a field linguist trying to record a dying agglutinative language, though, and particularly vulnerable to reductionist approaches.

“Maybe every little unit is its own word“, says the (eager-to-claim-territory) syntactician over your shoulder, while the imperialist morphologist may want to clump together every frequent short expression. The former would argue that English has words /ɪŋ/, /ɨd/ and /ɨz/ “with some trivial cross-word phonological effects, probably performance-related”; the latter would claim a quotative verb morpheme complex: /ʃiwʌzlɑjk/ (with alternate form /hiwʌzlɑjk/, “inflecting for quoted-entity’s gender”.  They each have a case, but they both feel, somehow, silly — like they’re overreaching (I hope you, dear reader, share this intuition. If you don’t, hassle me in the comments).

The Montemurro & Zanette result might make some morphologists nervous — or give the field workers some relief — it certainly gives me some. This result actually suggests an empirical answer to a question that is often answered as a matter of “elegance” or “taste” among linguists: a system for defining word boundaries.  A system of word construction in language L that offers roughly the same information across word-order as other languages M,N,P, and Q is to be preferred over one that offers — with respect to the next word —  too little information (the mad morphologist) or too much information (the mad syntactician).

It’s easy to take as dogma the idea that all human languages are equally capable of communication — I generally do — but it’s surprising and delightful to find that this dogma holds even in this subcomponent of language.

An afterthought: there are a lot of questions still to be explored in this paper. They’ve decided that Chinese “words”, for example, should be taken to be single characters, which is not necessarily a good decision. One must wonder if the constant-information conclusion they’ve come to is a question somehow begged by some hidden assumption lurking in the mechanisms of determining word boundaries in the texts they have.

