Spelling Suggestion on the JVM #1: Norvig's Approach to Spelling Correction
A Word on Spelling Suggestion
Spelling Suggestion is all about helping document writers spell words correctly.This service is pervasive among editors, word processors and, nowadays, even IDE’s.
We’ve chosen spelling suggestion as a tutorial subject because of how familiar an application domain it is. This application domain also lends itself to showcase modern language constructs such as functional programming, extension methods or type inference, to name a few.
Simple, readable code examples are first presented in Java 9 so as to ensure they’re readily understood; this reduces the cognitive effort of perusing implementations written in not yet familiar languages.
A typo is the mistaken transcription of a valid, _in-dictionary _word. Most typos originate in one or more of the following cases:
deletes: A letter is omitted from the word: dilbert becomes dlbert
inserts: A letter is inserted that doesn't belong in the word: dogbert becomes dogvbert
transposes Two contiguous letters are swapped: alice becomes alcie
replaces: A letter is changed to another letter: wally becomes walky
Note a delete could be amended with an insert. Correspondingly, a transpose could be fixed with an appropriate replace. We have two groups of symmetric edits.
Hence, upon detecting a typo, we can hope to reconstitute the original word by applying an “opposing” edit. This insightful realization is the core of Norvig’s algorithm.
Obviously, we don’t know what letter was omitted or what letters were transposed. Therefore. we need to traverse the entire problem space by brute-force generating all possible opposing edits.
Thus, for example, we can try to amend dlbert with dalbert, dbbert, dcbert, …, dzbert and check to see if one or more of the resulting strings actually occurs in the dictionary (which, happily in this case, happens to appear as dilbert).
The general strategy is, then, to exhaustively apply four edits to the typo in the hope of finding at least one dictionary word that was distorted by the typo.
Statistically, most typos arise from one edit only (which Norvig dubs
edit1). Two-edit typos
edit2), while less frequent, are sufficiently common so as to warrant their own two-pass
attempt at fixing the typo.
edit2 is used only when
edit1 fails to find at least one matching
edit2’s input is the result of a failed
These two scenarios should suffice for the vast majority of typo cases. Of course, it’s still possible for a typo to slip through our crib anyway but we’re not aiming for perfection ;-)
Another frequent type of typo is the joining of two contiguous words by omitting the intervening
space: two-word poor asok mutates to poorasok.
Norvig’s solution to this conundrum is to apply
edit1 and, possibly,
edit2 to all possible
typo word splits. Word split pairs progress from
("", "poorasok") to
("p", "oorasok") to
("poo", "rasok"") all the way down to
Note: for the sake of simplicity we’re not addressing the case where a space is inserted inside a single word, which effectively turns it into two separate words. By operating on entire sentences rather than individual words we can detect the occurrence of two consecutive typos and try to compensate by joining them. If the result is not a dictionary word then we apply vanilla single-word correction to both typos.
Next in our plate: Implementing Norvig’s Algo in Java