Ricardo Rocha's Website

Musings on Programming and Programming Languages

Email GitHub Bitbucket Twitter LinkedIn

Spelling Suggestion on the JVM #2: Implementing Norvig's Algo in Java

This tutorial project implements a basic spelling suggestion service. The motivation is didactic: becoming familiar with “better java” languages around a simple but instructional example. Ease of implementation outranks optimal performance so readers can focus on the JVM languages as such. Examples and new concepts are introduced in Java 9 first so they’re immediately understandable to the experienced Java programmer. Crucially, the Java implementation is accompanied by equivalent (but idiomatic) implementations in today’s most relevant alternative JVM languages: Kotlin, Scala and Xtend. Code is available at Spellbound’s Github repo.

This second post illustrates how to implement Norvig’s spelling corrector in Java 9 following a functional style.

The next post Implementing Norvig’s Algo in Kotlin presents an idiomatic Kotlin implementation of Norvig’s spelling corrector.

The previous post (Norvig’s Approach to Spelling Correction) discusses Norvig’s approach to generating typo corrections.

If you haven’t already, taking a cursory look at Norvig’s Python script may be useful in digesting this Java implementation.

The Dictionary

Everything having to do with spelling correction revolves around a curated list of valid words we refer to as the the dictionary.

In our dictionary implementation we accompany every word with an integer rank indicating how frequently the word is used in relation to all other words. The lower the rank, the more frequently used the word is. Thus, for instance, the has rank 1 while triose has rank 106295.

The dictionary declaration is:

public class SpellingCorrector {
  private final Map<String, Integer> dictionary;
  // stuff...
}

Transforming the typo into a set of word splits

In order to hold word splits we define an inner data class WordSplit as follows:

public class SpellingCorrector {

  // stuff...
  
  static class WordSplit {
    final String left;
    final String right;

    WordSplit(String left, String right) {
      this.left = left;
      this.right = right;
    }
  }
}

Note: no getter/setter or other boilerplate needed. This is an immutable data class visible to the SpellingCorrector and its tests only.

Armed with this definition we can now formulate the rather trivial implementation of splits:

static List<WordSplit> splits(String word) {
  return IntStream
    .rangeClosed(0, word.length()).boxed()
    .map(i -> {
      String left = word.substring(0, i);
      String right = word.substring(i);
      return new WordSplit(left, right);
    })
    .collect(toList());
}

The split() method takes a string (the typo) and returns all possible pairs of word splits. Thus, for our beloved but misspelt hero dlbert we get the equivalent of:

List.of(
  new WordSplit("", "dlbert"),
  new WordSplit("dl", "bert"),
  new WordSplit("dlb", "ert"),
  new WordSplit("dlbe", "rt"),
  new WordSplit("dlber", "t"),
  new WordSplit("dlbert", "")
);

By the way, all four edit implementations take a List<WordSplit> argument rather than a simple List<String>. All of them return a Stream<String>, too.

IntStream.rangeClosed(int startInclude, int endInclusive) generates an IntStream. In our example this means all integers between 0 and 6. Because we want to generate a Stream<String> (not an IntStream) we apply boxed() to the range IntStream so it becomes a Stream<Integer>.

Subsequently, each successive integer value is used as a substring index for splitting the word into a pair of left and right fragments. This Stream<String> is finally collected as a List<WordSplit>.

Edits #1: deletes

Generating all possible deletes for a word is also quite simple:

static Stream<String> deletes(List<WordSplit> splits) {
  return splits.stream()
    .filter(split -> !split.right.isEmpty())
    .map(split -> split.left + split.right.substring(1));
}

Here, we stream over the input List<WordPair> selecting only the ones where right is not empty (which is equivalent to all splits but the very first one). Next, we concatenate each word split into a string formed by the left part followed by the “tail” of the right one.

Given the typo waly deletes() yields the equivalent of:

List.of("aly", "wly", "way", "wal");

Edits #2: inserts

Inserts are the opposite of deletes and while their implementation is still quite simple:

static Stream<String> inserts(List<WordSplit> splits) {
return splits.stream()
  .flatMap(split ->
    Arrays.stream(LETTERS).map(letter -> split.left + letter + split.right));
}

In this method our friend LETTERS makes its first appearance. Again for the sake of simplicity we deal only with lowercase ASCII characters. There is no support for diacritics though its implementation is simple anyways). LETTERS is statically defined as:

static final String[] LETTERS = "abcdefghijklmnopqrstuvwxyz".split("");

Splitting on an empty regex results in each string’s character becoming a 1-char string on its own right.

For our canine friend dgbert inserts() would yield the equivalent of:

List.of(
  "adgbert", "bdgbert", "cdgbert", "ddgbert", "edgbert", "fdgbert",
  "gdgbert", "hdgbert", "idgbert", "jdgbert", "kdgbert", "ldgbert",
  "mdgbert", "ndgbert", "odgbert", "pdgbert", "qdgbert", "rdgbert",
  "dgbertm", "dgbertn", "dgberto", "dgbertp", "dgbertq", "dgbertr",
  // lots of inserts ellided...
  "dgberts", "dgbertt", "dgbertu", "dgbertv", "dgbertw", "dgbertx",
  "dgberty", "dgbertz")

Edits #3: transposes

Implementing transposition requires the right fragment to have more than one character in length, which makes sense as transposing involves not one but two letters:

static Stream<String> transposes(List<WordSplit> splits) {
  return splits.stream()
    .filter(split -> split.right.length() > 1)
    .map(split ->
      split.left +
        split.right.substring(1, 2) +
        split.right.substring(0, 1) +
        split.right.substring(2));
}

For our heroine alice transposes() yields the equivalent of:

List.of("laice", "ailce", "alcie", "aliec");

Edits #4: replaces

The replace edit is very prolific requiring the substitution of every letter in the typo for each letter in the alphabet. It also requires the right segment not to be empty (as something needs to be replaced). Because we’re nesting the processing of each letter inside the processing of every typo character we need to apply flatMap rathen than map:

static Stream<String> replaces(List<WordSplit> splits) {
  return splits.stream()
    .filter(split -> !split.right.isEmpty())
    .flatMap(split ->
      Arrays.stream(LETTERS).map(letter ->
        split.left + letter + split.right.substring(1)));
}

For our little friend asok replaces() yields the equivalent of:

List.of(
        "asok", "bsok", "csok", "dsok", 
        "esok", "fsok", "gsok", "hsok",
        // quite a few replaces ellided...
        "asos", "asot", "asou", "asov",
        "asow", "asox", "asoy", "asoz"
    );

edit1: Applying All Edits Hoping for Valid Words

To cover all our bases we need to run and then concatenate all four edits. For this we define a static list of edits as follows:

private static final List<Function<List<WordSplit>, Stream<String>>> edits =
  List.of(
    SpellingCorrector::deletes,
    SpellingCorrector::transposes,
    SpellingCorrector::replaces,
    SpellingCorrector::inserts
  );

This is an instance of a common functional pattern where code is treated as data. Here, we have a list of functions mapping from List<WordSplit> to Stream<String> (which is, precisely, the type of transformation our four edits implement). In this context Java method references shine as they, effectively, provide us with a data-centric view of executable code.

We make use of this list of functions to implement edit1 as follows:

private List<String> edits1(String typo) {
  // Generate all splits for the word so as to account for typos originating
  // in the insertion of a space in the middle of the word
  List<WordSplit> wordSplits = splits(typo);

  // Generate and apply all 4 edits (in parallel) to each split. Packing removes
  // duplicates, ensures result presence in dictionary and orders by rank
  return pack(edits.parallelStream()
      .flatMap(edit -> edit.apply(wordSplits)));
}

Because the four edits are independent of one another we can parallelize their execution. This is transparently handled by Java by means of the Stream.parallelStream() method.

Also, because each edit returns a nested Stream<String> we need to use flatMap instead of map.

What is this pack() method surrounding everything?

As it happens, edit1 is not yet done when all edits have been concatenated.

Once edit1 generates all possible edits we need to screen them for duplicates as well as for actual appearance in the dictionary. We also need to order the resulting valid words so the most frequently used ones are shown first. This is all achieved as follows:

private List<String> pack(Stream<String> editResults) {
  return editResults
    // Remove duplicates
    .distinct()
    // Select only words present in dictionary
    .filter(dictionary::containsKey)
    // Sort by word rank so more frequent words show first
    .sorted(Comparator.comparing(dictionary::get))
    .collect(toList());
}

As customary, results are collected into a List<String> suitable for displaying suggestions to the user.

edit2: When edit1 Falters

Despite generating lots of possible suggestions, edit1 can fail to yield a dictionary word In this case we assume our typo stems not from one but two transcription mistakes.

edit2 is quite simply defined in terms of edit1 as follows:

private List<String> edits2(String typo) {
  // Repeatedly apply all 4 edits twice, and in parallel, to each split.
  // Packing removes duplicates, ensures result presence in dictionary and
  // orders by rank
  return
    pack(edits1(typo).parallelStream()
      .flatMap(w -> edits1(w).stream()));
}

Here we see our friend pack() in action once again to ensure word uniqueness, validity and order.

Again, we use flatMap because we generate candidate suggestions from the nested edit1’s Stream of results.

Putting it All Together

We’re almost there now. Our workhorse method getCorrections() method looks as follows:

public Optional<List<String>> getCorrections(String word) {
  // Ensure word format matches that of the dictionary: lowercase alphabetics
  String normalizedWord = normalize(word);
  // If word occurs in dictionary return no suggestions
  if (dictionary.containsKey(normalizedWord)) {
    return Optional.empty();
  }
  // The correction suggestions to be returned
  List<String> corrections;
  // Suggestions for one-edit typos: most typos contain just one error
  List<String> corrections1 = edits1(normalizedWord);
  // If edit1 yields no dictionary word, let's try with 2 edits.
  // Some typos stem from 2 errors; few come from more than 2
  if (corrections1.isEmpty()) {
    // Apply two-level dictionary word reconstitution
    List<String> corrections2 = edits2(normalizedWord);
    // No results even for 2 edits: return empty list
    if (corrections2.isEmpty()) {
      corrections = emptyList();
    } else {
      // edit2 did produce results, yay!
      corrections = corrections2;
    }
  } else {
    // edit1 did produce results, yay!
    corrections = corrections1;
  }
  // Return (possibly empty) list of suggested corrections
  return Optional.of(corrections);
}

Note we return Optional<List<String>> rather than a plain List<String>. Why?

We need to express the absence of suggestions not because the typo doesn’t match anything known but, on the contrary, because it was handed a valid dictionary word for which the notion of suggestion simply doesn’t make sense. In this scenario we return Optional.empty().

It could be argued that returning List.empty() would be appropriate to indicate such inapplicability.

In reality, however, it is perfectly possible for a word absent from the dictionary (.i.e, a typo) not to resemble any known word. In this case we still need to make it explicit we’re dealing with a typo. Thus, List.empty() is not semantically equivalent to Optional.empty().

We there yet?

Yes, we are ;-)

The complete Java implementation has 300+ lines including comments and empty lines. Its neutron star-dense incarnation comes down to 120 lines.

Compare that to Norvig’s feat of packing the whole thing in 44 lines of Python code!

Java’s verbosity is precisely one of the most important reason alternative JVM languages have emerged. In the next 3 posts we’ll see how the Kotlyn, Scala and Xtend versions improve upon Java in terms of economy of expression and clarity of intent.


Next in our plate: Implementing Norvig’s Algo in Kotlin

comments powered by Disqus