June 5th, 2011


Twitterlog for the week of 2011-06-05

  • Living room reading corner. I love the light here. http://picplz.com/x3gp #
  • Yes, this IS a baseball signed by Chomsky. A graduation gift from my aunt. http://picplz.com/xjBb #
  • Seattle weather comes to visit San Francisco (@ San Francisco Caltrain Station) http://picplz.com/gF4R #
  • .@KevinBMcGowan has a brilliant PI and does great PR for their joint research: http://t.co/zsZYSmB #
  • http://t.co/QNJVsVK holy crap people: another lunatic attacking bikes with a car. Attn @sfbc, can we catch this dude like yesterday please? #
  • Oops. Hacked the security guy's presentation at work. Eggs on faces all around. #
  • standards nerds (@soypunk @hober @jessnevins ?) may enjoy this deliberately bass ackwards 'splainer of 'cal 9 1752' http://bit.ly/ka1Fjr #
  • Ugh, it's horrible watching suicide updates in real time from @caltrain #
  • Yes, it's true — @imtboo & I are moving back to the Emerald City. Many details to work out, but short version: SFO→SEA, 2011/09 #
  • There is a straight line from Peanuts → Calvin & Hobbes → "Cul de Sac". Here: accurate forays into steganopragmatics: http://bit.ly/kg7ft2 #
  • Euro folks &c: know an Irish linguist (ally) with a spare couch? @maryam_bakht is couchsurfing Dublin and/or Limerick this weekend. #
  • There really is a "Fairy & Human Relations Congress" http://bit.ly/l01GUc : hippies with all the naming pizzazz of the UNHRC
    (h/t @eldang) #
  • No, drunk Giants fan on Caltrain. This is BIKE CAR; thus not your personal piss spot. Had it been labeled "butthead car", maybe better luck. #
  • At home with cloudy weather on a Friday evening playing guitar and drinking a beer. Not exactly living large but it'll do. #
  • Oy. Left laptop outside front door after taking it off my bike when I got home. Luckily wife found it before trash pickers did. #iamanidiot #

Mirrored from Trochaisms.


order word constant information does have?

I’ve been reading — in my “copious free time”, as they say:

Montemurro MA, Zanette DH, 2011 “Universal Entropy of Word Ordering Across Linguistic Families.” PLoS ONE 6(5): e19875. doi:10.1371/journal.pone.0019875

This article takes on some of the really nifty ideas of information theory, and asks one deceptively simple question:

how much information does word order contribute to human language?

It turns out that this question is not entirely straightforward to answer: languages vary tremendously in the information carried per-word.  Agglutinative languages like Turkish or Inuit, for example, are capable of encoding a tremendous number of separate information particles into a single word; inflecting languages (Latin, Hindi, Greek) encode information about how words relate to each other in a sentence right on the words themselves; but relatively isolating languages (English, and to a greater degree, Mandarin Chinese) tend to have fewer variants for a given word, but count on additional words and word order to encode the factors.

So one might expect, for example, that word order bears more information in a language like English or Mandarin, but less information in a highly-inflected or -agglutinative language like Latin or Turkish. But a quick summary of this study suggests that word order bears a constant information rate, even across these very large language differences, with their very large differences in the information content of individual words. Another way of saying this:

regardless of a language’s idea of how much information makes up a word, the surprisal of the next word is roughly constant.

Linguists often wrestle with how to define a word boundary — especially in agglutinative languages, of course, but even in languages with long-standing orthographic traditions: how many words is “I’m gonna get ‘em“?  How about “moi, je l’aime“?  A phonologist, a syntactician, and a professional copy-editor (not to mention a teacher with an essay word-count assignment) will probably disagree.  Matters are substantially worse if you’re a field linguist trying to record a dying agglutinative language, though, and particularly vulnerable to reductionist approaches.

“Maybe every little unit is its own word“, says the (eager-to-claim-territory) syntactician over your shoulder, while the imperialist morphologist may want to clump together every frequent short expression. The former would argue that English has words /ɪŋ/, /ɨd/ and /ɨz/ “with some trivial cross-word phonological effects, probably performance-related”; the latter would claim a quotative verb morpheme complex: /ʃiwʌzlɑjk/ (with alternate form /hiwʌzlɑjk/, “inflecting for quoted-entity’s gender”.  They each have a case, but they both feel, somehow, silly — like they’re overreaching (I hope you, dear reader, share this intuition. If you don’t, hassle me in the comments).

The Montemurro & Zanette result might make some morphologists nervous — or give the field workers some relief — it certainly gives me some. This result actually suggests an empirical answer to a question that is often answered as a matter of “elegance” or “taste” among linguists: a system for defining word boundaries.  A system of word construction in language L that offers roughly the same information across word-order as other languages M,N,P, and Q is to be preferred over one that offers — with respect to the next word —  too little information (the mad morphologist) or too much information (the mad syntactician).

It’s easy to take as dogma the idea that all human languages are equally capable of communication — I generally do — but it’s surprising and delightful to find that this dogma holds even in this subcomponent of language.

An afterthought: there are a lot of questions still to be explored in this paper. They’ve decided that Chinese “words”, for example, should be taken to be single characters, which is not necessarily a good decision. One must wonder if the constant-information conclusion they’ve come to is a question somehow begged by some hidden assumption lurking in the mechanisms of determining word boundaries in the texts they have.

Mirrored from Trochaisms.