?

Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Language Computeer
Fists of irony
trochee
trochee
twit haiku hacking

For those of you not following along on the Twitter thing, you may not know that I've built a bot:
Ohaihaiku1 is a twitterbot: a 'robot' (automated script) that identifies and retweets english language haikus. some of them are pretty good!:
Spring frock in warmth yesterday /
to tuque and snow today /
Oh Calgary

Here's how it works.
From time to time (at the moment, whenever I remember; I haven't automated this part completely yet) it looks at the twitterers it's following, and looks over all their tweets. Each tweet is cleaned up slightly, and then checked to see whether it's 17 syllables. If it is, they are reformatted as nearly as possible into 5/7/5, and retweeted.

Somewhat less often, ohaihaiku checks the twitter public timeline and pulls in a couple hundred tweets and puts them through the same process. If a haiku is found, it retweets it and "follows" that user.

Lastly, if another user 'follows' ohaihaiku, it will reciprocate2 -- and if a user 'unfollows' it, it will unfollow in return. I try to make it be polite like that.

challenges:

  • There are a few challenges here still. It's not fully-automated -- I am not quite ready to put it on a server and leave it alone overnight, even though it runs all its checks with a one-word command.
  • I'd like to build out its English dictionary. It's currently based off marnanel's excellent Lingua::EN::Phoneme, which itself is built on the CMU dictionary, and when it cannot find a word's pronunciation there it falls back to some more robust hyphenation code for syllable counts. But there are lots of words (e.g. twittering) that these approaches are less than ideal for.
  • I'd like to look into some of the easier number and acronym normalizations: "lol" should be one syllable, but surely "OMG" should be three?
  • I'd like to put in some language-id. This is challenging for two reasons: (1) tweets are very short and (2) the usual approach -- character trigrams3 -- will be very sparse on such short tweets, but it will help with the occasional accidental processing of Polish or (more problematic) German. I could probably set a threshold for the percentage of words that are not in the dictionary, but I'd rather do this the Right Way with character trigrams.

Check it out! enjoy. let me know if you see things. Anybody wants to peek at the code, I can point you to some of my git repositories; several of the modules I built to support this will become CPAN distributions of their own in time.

1get it? it's like "O Hai - Haiku!"... well, I thought it was clever at the time, but the fact is, ohaihaiku does crap with mis-spellings.
2I reserve the right to un-follow and/or block spammers or people who post too many non-words.
3Character trigrams should probably be called tri-glyphs but what can you do?
11 comments or Leave a comment
Comments
From: evan Date: April 23rd, 2009 01:01 am (UTC) (Link)
I love the name.
trochee From: trochee Date: April 23rd, 2009 03:11 am (UTC) (Link)
thanks! I thought it was clever too.
isolt From: isolt Date: April 23rd, 2009 02:38 pm (UTC) (Link)
rockingly awesome. *follows*

(sadly, it probably won't recognize the Japanese-language tanka I tweeted the other day!)
trochee From: trochee Date: April 23rd, 2009 05:07 pm (UTC) (Link)
god no. though I think identifying Japanese haiku might actually be harder, given the multiple-pronunciation problem.
q_pheevr From: q_pheevr Date: April 23rd, 2009 02:44 pm (UTC) (Link)

That's really nifty. The example you quote in this post is really very haiku-like in spirit as well as in form, but I also get a kick out of the fact that ohaihaiku is picking up metrically perfect1 haiku from 10 Downing Street:

Alcohol duties
Will go up by 2 per cent
From midnight tonight
"lol" should be one syllable, but surely "OMG" should be three?

I tend to pronounce OMG in my head as one syllable,2 but then I have a rather high tolerance (or possibly even penchant) for consonant clusters that disobey English phonotactics. I'm curious, though, about the extended version ZOMG, in which (afaik) the Z doesn't stand for anything—would that be four syllables for you? (I have no trouble thinking of OMG as /o ɛm ʤi/ even if I kind of prefer /ɒmɡ/, but ZOMG for me pretty much has to be /zɒmɡ/, and not /zɛd o ɛm ʤi/.)

Character trigrams should probably be called tri-glyphs but what can you do?

How about "trigraphs"?


1. Assuming the syllable-counting English version of haiku meter, of course, since a mora-counting 5/7/5 would be ridiculously constraining in English.

2. And I don't pronounce it outside my head at all.

trochee From: trochee Date: April 23rd, 2009 05:14 pm (UTC) (Link)
hah, all very clever comments (and yes the Downing Street memo, if you will, is possibly my favorite),

You know, regarding footnote 1: I considered constraining it to 5/7/5 strictly, but that cuts out about three-quarters of the haikus it's detecting. Also unmentioned is the formal though extremely high-level constraint that a haiku is supposed to contain a (possibly elliptical) reference to the seasons; I've just left that out entirely.

funny enough, I kind of treat "omg" as two syllables, because I want it to be /om gə/ or /om gɐ/, and even more so for ZOMG (i cannot imagine /zɛd o ɛm ʤi/ either). But I think that these are mostly textual, so syllable counts get hard. How many syllables are in "^__^" anyway?

I do like 'trigraphs'.
q_pheevr From: q_pheevr Date: April 23rd, 2009 07:10 pm (UTC) (Link)
How many syllables are in "^__^" anyway?

Ten.

I considered constraining it to 5/7/5 strictly, but that cuts out about three-quarters of the haikus it's detecting.

I definitely think you were right to be more lenient about the line breaks—it would be a shame to miss out on gems like the Calgary haiku (which fulfils the seasonal requirement) just for the sake of following strictly a rule that is, after all, an adaptation. (Cole and Miyashita (2006) claim that traditional Japanese verse is actually based on eight-mora lines—there's one silent mora at the end of the seven-mora lines, and three at the end of the five-mora lines—and note that there's some room for variation: you can substitute one extra pronounced mora for a silent mora.)

trochee From: trochee Date: April 23rd, 2009 07:26 pm (UTC) (Link)
any argument for three silent morae seems a little contrived, unless they're actually recording pronunciations. Is that the argument?

and your hyper-joke is awesome.
q_pheevr From: q_pheevr Date: April 23rd, 2009 09:58 pm (UTC) (Link)

Yep. They recorded native speakers reading tanka, and measured the durations of lines and pauses:

5-mora line7-mora line
mean duration of spoken portion:801 ms1060 ms
mean total duration including final pause:1216 ms1195 ms

So the five- and seven-mora lines are padded with silence to achieve essentially equal duration.

trochee From: trochee Date: April 23rd, 2009 10:02 pm (UTC) (Link)
neat! hey look everybody! this is what it means when most of us say 'empirical'.
soliss From: soliss Date: April 23rd, 2009 06:08 pm (UTC) (Link)
Yes, way to mention the mora thing. :)

I tend to pronounce zomg as "zo my god" in my head. Three syllables...
11 comments or Leave a comment