The first body chapter of my dissertation is on re-ranking speech-recognition hypotheses, using grammatical structure (parse) information. I'm doing a number of things that are new (which I won't go into) but I am evaluating the reranking success by measuring its performance on word-error rate (WER).
I am exploring what features (classes of information) are useful in doing this reranking -- in particular, what features from parse structure are useful (in speech-recognition, it's pretty well-established that the speech-recognizer's own scores are worth paying attention to). So I am considering two scenarios:
- [in addition to speech-recognizer scores], add only the "parse-quality" scalar
- [as above, but also include]a very long (dimension 20k or so) vector of non-local features, like "count of NPs"
baseline: 0.236361 [baseline] parselm: 0.230343 fullfeats: 0.230343 oracle: 0.161255 [best possible rerank]So here's the mystery: why is fullfeats getting exactly the same values as parselm? with 20k additional features in the vector, I'd expect that it might even get worse ("the curse of dimensionality") but I wouldn't expect these results to be exactly the same.
Advisor has suggested that there may be a bug in my code, so that is today's Big Question, to try to work backwards through the pipeline to work out if these models are "accidentally" producing exactly the same results (which says I may have to re-evaluate what learner I'm using) or if something more severe has gone wrong (which would actually be more of a relief, because I want the improvements to be larger than 0.6 WER, and I'm looking also to see why there wasn't very much).