March 7th, 2004


too good to be true?

I'm working with a previously-mentioned SVM classifier kit for text classification for a class project.

This morning, I was getting 74% accuracy on my test set. But with the correction of one (fairly serious) bug, I jumped to 90% accuracy -- and now I'm getting the following results:

1000    0.891228
1500    0.887719
2000    0.900000
2500    0.900000
3000    0.903509
4000    0.905263
5000    0.912281
5500    0.917544  <= !!!
6000    0.912281
7500    0.907018
10000   0.896491
This makes me nervous, because it's possibly too good to be true. The classifier in that highlighted case is using only 5500 words of to classify a conversation transcript in one of 67 topics -- entirely with the statistical distributions of the words.

That's right. No n-grams, no syntactic knowledge, no boosting from WordNet (although I want to try all those things). And it still gets almost 92% accuracy?

I certainly hope it's for real, but I feel like I'd better double-check my numbers... We'll be tested on how our classifier performs on the unlabeled test set, not the dev set.

  • Current Music