love, play & inquiry (trochee) wrote,
love, play & inquiry

  • Mood:

historical data-management

I've spent the last two weeks (!) trying to figure out how to relate no less than five different kinds of truth.

Before anybody thinks I've gone mystic, I should clarify: in speech recognition research, and other machine-learning contexts, truth refers to the right answer. We have hours and hours of conversations, transcribed by listeners at the Linguistic Data Consortium.

Unfortunately, there has been more than one pass at coming up with the right words -- the right truth. And derivative data like treebanks are based off one version, and not always the latest best one. So to do the kind of work I'm doing -- relating treebank annotation to prosody annotation -- I have to relate the latest, best truth words (for which we have prosody annotations) to the substantially older truth words that the treebanks were based on.

This word-alignment was supposed to be about a day's work in coding. But it's turned into two weeks of tedious examination of the various versions of truth words, trying to discover the differences and reproduce the various changes and script-based normalizations that got us from the old bad truth to a new and better truth.

It feels, in an ironic way, like I am doing historical linguistics, with each version of the truth words being a different attested language, and trying to work out how they all relate to each other by looking for mechanisms of change (digging around in the misleading, wrong, lost, or never-written documentation), grouping together those corpora that seem similar. I'm effectively using the Historical Method, except I'm doing it the way that the historical linguists never could until recently -- with Perl and emacs in hand, hammer-and-a-nail.

It's actually been an interesting project (and it's almost done, which is what I've been saying about it for about 13 days of the last two weeks). The frustrating thing is that of all the cleverness in data-munging I've done, and all the careful code- and data-archaeology that I've done to get here, none of it is publishable. I'm just hoping that the other researchers I'm doing this for are grateful enough to put me in as a secondary author.

reading list:
The latest issue of The Nation, headlined The Coronation of George W. Bush: the GOP Convention Issue
  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.