October 30th, 2009


Advice on breaking into NLP

a reading list I'm on asked (I paraphrase): "I'm a linguist without much CS or stats background, but this NLP thing seems really neat and I want to be a REAL computational linguist -- what should I do?"
I wrote the following, which I repost here:
For someone in your position (or for that matter, if you're a stats person or a programmer who thinks that NLP might be cool) I think there are really three things that you'll need to become an "NLP wizard" -- two of them are things you'll need to be a [programming] wizard of any flavor (I use zero-based indexing below to put the CS peeps at ease):
  • [0] learn the specialty: Make sure you are actually doing what the scientific/engineering NLP papers are talking about. A really fantastic place to begin is to splurge the sixty or seventy bucks to buy a copy of Manning and Schutze's "Introduction to Statistical Natural Language Processing". This textbook has two or three introductory chapters for basic linguistics concepts and two or three more for basic math/statistical-learning concepts. In my experience working with both linguists and statistics people, these are all really great introductory chapters; you as a linguist will probably skim the linguistics introduction (you know what a phrase is, and what a part-of-speech, etc) but the statistics intro will be most useful (what's an expectation, what's a KL distance, what do we mean by "distribution", some basic set and probability theory, etc). Stats people who are interested in NL work should have the converse experience, with the linguistics chapters being useful catchups and the stats chapters review.
  • [1] Learn the languages: learn at least two programming languages that are used in NLP. Python should probably be one of them, because the NLTK is a very accessible (if not always the fastest) collection of libraries for doing the sort of natural language processing research described in Manning and Schutze. I recommend the "Learning XXX" series from O'Reilly publishers; they seem to have found a pretty good formula for working your way through a new programming language. [I personally am most comfortable in Perl, but that's inertial, because I learned Perl before Python even existed.] Learning a second programming language -- like learning a second natural language -- gives you perspective on the first and stretches your conception of what a [programming] language can do, plus it also may make you a more valuable hire.
  • [2] Learn to program with others: Programming is not just making the computer do what you want; it's actually a social activity: your colleague, your manager, your user support team, or your QA team --- among other candidates --- will want to share or modify or read or improve the code you're writing. Or you'll want to share theirs, etc. Steve McConnell's "Code Complete" book is a great resource for learning the basics of how to write code for the sake of collaborating with others, whether in a Free Software model or in a for-profit company.
If I had to guess, I would say that you should take on #0 and #1 first, and once you feel like you have made some progress, start working on #2. But don't wait too long; you don't want to develop too many bad collaboration habits.
Good luck!