December 13th, 2003


Using language models to solve cryptograms

I woke up Friday morning with the thought (it was actually in a dream!) that one could use character n-gram models to determine the relative quality of solutions to cryptograms, without reference to a dictionary or other word lists. Ideally, one could train a system on a sufficient body of text for the language in question, and then be able to rank proposed keys. If the search space turns out to have easily-reached global maxima, then one might be able to solve them without considering all possible cases.

A little further reading on Wikipedia found me a neat page on frequency analysis, but it seems like nobody has done exactly this. I have no illusions that I have found any new ideas of any kind in crypto-analysis, but it seems like it would be a good exercise in language modeling:

(1) build a character-based n-gram for English or any language E
(2) generate all possible cipher relations for a given cryptogram
(3) evaluate them with respect to their perplexity on the n-gram.

Presumably the highest-ranking analyses would be the ones closest to E text. This would be neat because although it would need E text to train on, it would not require a dictionary or a word-frequency table. Now, this is not particularly useful either -- storage is cheap -- but it seems like a neat idea to crack a cryptogram based *only* on frequency analysis.

Since step (2) is factorial or worse in complexity (possibly exponential), it seems like the real problem here is search -- no way it's a good idea to generate all possible cipher relations (26! is a very big number, 4e26). So one would like to try to find the right solution by starting at a random point and walking our cipher-key towards better solutions, using this character metric on "nearby" changes.

What I wonder about is whether there's a useful guarantee of local increases leading to the global maximum or whether it would get stuck at local maxima. If it would get stuck, then the next question is how likely is it that there are only a few local maxima? If there are 10000 local maxima, then that's not much fun, is it. If there are only 3 or 4, then the odds are pretty good that by restarting at random points, you'll encounter most of them.

This is the sort of thing I hope we cover in my Statistical Pattern Recognition class next quarter.