love, play & inquiry (trochee) wrote,
love, play & inquiry

Perl and Penn Treebank musings

Hmm... I'm looking very closely at creating a CPAN module for dealing with the Penn Treebank markup format. (This is a way of marking up grammatical structure on sentences.)

( (INTJ (UH Hello) (-DFL- E_S) ))
( (S
    (NP-SBJ (DT this) )
    (VP (VBZ is)
      (NP-PRD (NNS Lois) ))
    (. .) (-DFL- E_S) ))

They're actually an elegant format, but now that Perl 5.8 ships with Text::Balanced, it's really a quite elegant snippet of code:

use Text::Balanced 'extract_bracketed'; # thanks Damian!
sub get_tags {
    # pass it a complete constituent, it returns the tag plus a list
    # of its subconstituents. If subconstituents themselves have
    # structure, then they will be arrayrefs
    local $_ = shift;
    my ($tag, $children) = /^ \( ( [\S]* ) \s (.*\S) \s* \) $/sx;

    my @children;
    while ($children) {
        my $child = extract_bracketed($children, '()');
        if (defined $child) {
            # child is itself a constituent
            $child = [ get_tags($child) ];
        else {
            # this is a word; we're done
            ($child, $children) = ($children, '');
            warn "trouble -- two tokens in preterminal" if @children;
        push @children, $child;
    return ($tag, @children);

