A Spellchecker Used to Be a Major Feat of Software Engineering

Here's the situation: it's 1984, and you're assigned to write the spellchecker for a new MS-DOS word processor. Some users, but not many, will have 640K of memory in their PCs. You need to support systems with as little as 256K. That's a quarter megabyte to contain the word processor, the document being edited, and the memory needed by the operating system. Oh, and the spellchecker.

For reference, on my MacBook, the standard dictionary in /usr/share/dict/words is 2,486,813 bytes and contains 234,936 words.

An enticing first option is a data format that's more compressed than raw text. The UNIX dictionary contains stop and stopped and stopping, so there's a lot of repetition. A clever trie implementation might do the trick...but we'll need a big decrease to go from 2+ megabytes to a hundred K or so.

In fact, even if we could represent each word in the spellchecker dictionary as a single byte, we need almost all the full 256K just for that, and of course the single byte representation isn't going to work. So not only does keeping the whole dictionary in RAM look hopeless, but so does keeping the actual dictionary on disk with only an index in RAM.

Now it gets messy. We could try taking a subset of the dictionary, one containing the most common words, and heavily compressing that so it fits in memory. Then we come up with a slower, disk-based mechanism for looking up the rest of the words. Or maybe we jump directly to a completely disk-based solution using a custom database of sorts (remembering, too, that we can't assume the user has a hard disk, so the dictionary still needs to be crunched onto a 360K floppy disk).

On top of this, we need to handle some other features, such as the user adding new words to the dictionary.

Writing a spellchecker in the mid-1980s was a hard problem. Programmers came up with some impressive data compression methods in response to the spellchecker challenge. Likewise there were some very clever data structures for quickly finding words in a compressed dictionary. This was a problem that could take months of focused effort to work out a solution to. (And, for the record, reducing the size of the dictionary from 200,000+ to 50,000 or even 20,000 words was a reasonable option, but even that doesn't leave the door open for a naive approach.)

Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being. Looking up a word in this hash table dictionary is a trivial expression, one built into the language. And that's it. Sure, you could come up with some ways to decrease the load time or reduce the memory footprint, but that's icing and likely won't be needed. The basic implementation is so mindlessly trivial that it could be an exercise for the reader in an early chapter of any Python tutorial.

That's progress.

permalink June 8, 2008

A Spellchecker Used to Be a Major Feat of Software Engineering

previously

archives