I'm working with a previously-mentioned SVM classifier kit for text classification for a class project.
This morning, I was getting 74% accuracy on my test set. But with the correction of one (fairly serious) bug, I jumped to 90% accuracy -- and now I'm getting the following results:
1000 0.891228 1500 0.887719 2000 0.900000 2500 0.900000 3000 0.903509 4000 0.905263 5000 0.912281 5500 0.917544 <= !!! 6000 0.912281 7500 0.907018 10000 0.896491This makes me nervous, because it's possibly too good to be true. The classifier in that highlighted case is using only 5500 words of to classify a conversation transcript in one of 67 topics -- entirely with the statistical distributions of the words.
That's right. No n-grams, no syntactic knowledge, no boosting from WordNet (although I want to try all those things). And it still gets almost 92% accuracy?
I certainly hope it's for real, but I feel like I'd better double-check my numbers... We'll be tested on how our classifier performs on the unlabeled test set, not the dev set.