nlprule icon indicating copy to clipboard operation
nlprule copied to clipboard

Add spellchecking

Open bminixhofer opened this issue 3 years ago • 8 comments

This PR implements spellchecking using the algorithm described in Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction. I tried multiple other methods:

  • SymSpell which would probably be faster but needs a significant amount of memory at runtime and needs a precomputation step thus either increasing the binary size a lot or taking a long time to load.
  • SimHash based spelling correction which in my experiments lead to low-quality corrections and is encumbered by a patent.

I ended up with the error-tolerant FST retrieval algorithm because it plays well with the portability goals of nlprule by needing very little memory while still being quite fast.

There are a couple of issues with the current implementation:

  • Edit distance is computed on byte sequences directly so it is not unicode aware. This will only make a difference in a couple of edge cases for the currently supported languages but it would be good to have unicode support.
  • [x] Edit distance only takes substitutions, additions and deletions into account (i. e. Levenshtein distance), not transposes (~~Damerau Levenshtein distance~~ optimal string alignment distance).
  • Keyboard proximity of the expected keyboard for a language variant is not taken into account.
  • [ ] Multi-word spelling correction is not implemented.
  • [ ] Documentation and tests need improvement.
  • [x] CI resources need to be updated.
  • [x] Tokens with action=ignore_spelling in the Disambiguation rules need to be ignored.
  • [x] Fix false positives for hyphenated words.

I still want to address the issues marked with a box in this PR. The others are out of scope.

It took quite some exploration of different solutions to come up with something I'm satisfied with: now the spellchecking relevant code is isolated in one module (which makes #50 easier) and is not too heavy on binary size (adds ~2.1MB, 900kb xzipped for English).

bminixhofer avatar Mar 02 '21 15:03 bminixhofer

There is a symspell package in rust https://github.com/reneklacan/symspell . Can we benchmark it and check whether the memory requirements are satisfied? What are the hard bounds on memory you're looking to support for this library?. And maybe we could integrate that package optionally perhaps if it's speed improvement is justified?

sai-prasanna avatar Mar 12 '21 10:03 sai-prasanna

Hi, I have to admit that I did not do a "proper" benchmark of SymSpell vs the FST approach. One of the problems with SymSpell is that the precomputes need to be either stored in the binary (which blows up the size of it) or precomputed every time when loading the Rules (which takes multiple seconds on a good PC). Runtime memory is of course another issue.

What are the hard bounds on memory you're looking to support for this library?

There are no hard bounds on memory. Up until now I decided trade-offs between speed and memory on a case by case basis, with a tendency to prefer speed. (also, notably, there are actually not that many places where this is relevant, often the best solution is both speed and memory efficient).

One issue with this library currently is that there are two very different use cases:

  1. Using it in an application. Portability, memory efficiency, small binaries are important (i. e. the typical Rust use case).
  2. Using it as a one-time pre / postprocessing step, as a server, etc. (i. e. the typical Python use case).

So far I've tried to find middle ground between these two where it was relevant. For Spellchecking, the FST-based approach is better for (1) while the SymSpell approach might be better for (2).

And maybe we could integrate that package optionally perhaps if it's speed improvement is justified?

I think this is the best solution. There is an open issue on modularizing the crate (#50) in a way in which integrating a second spellchecker would come quite naturally.

So, in short:

  • This PR and the initial implementation will keep the FST approach
  • I am very open to benchmarks between the FST approach and SymSpell
  • If it turns out that SymSpell is faster, it can be integrated as a second spellchecker once nlprule has better modularization.

bminixhofer avatar Mar 12 '21 12:03 bminixhofer

The API for spellchecking is fully implemented now, this is how it will look like:

Rust

use nlprule::{Rules, Tokenizer};

let tokenizer = Tokenizer::new("path/to/tokenizer.bin").unwrap();
// `Rules` now take an `Arc<Tokenizer>` as second argument, like in the Python API 
// and do not require passing the tokenizer to the other methods anymore
let mut rules = Rules::new("path/to/rules.bin", tokenizer.into()).unwrap();

// by default, spellchecking is disabled
// the spellchecker can be accessed by `.spell()` and `.spell_mut()`
// and contains `SpellOptions` which can be accessed analogously
println!("{:#?}", rules.spell().options()); // prints the default options

// Enables spellchecking. `Variant` is a string newtype denoting valid variants
// the `.variant` method on the spellchecker gets the variant for a language code 
// or returns an error if none exists
rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap());

// now `rules.correct` and `rules.suggest` also finds spelling mistakes!

// there are some other options like a whitelist for words (the whitelist is actually a `HashSet<String>`)
rules
    .spell_mut()
    .options_mut()
    .whitelist
    .insert("nlprule".into());

// and that's it! there are also some new utility methods like `.variants()` on the spellchecker 
// which are not too important

Python

import pytest
from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

# rules.spell is the spellchecker object
# does not do anything useful besides providing the ability to set and get options at the moment
print(rules.spell)

# the used variant is `None` by default i. e. spellchecking is disabled
assert rules.spell.options.variant is None

# setting the variant to one which does not exist prints a helpful error message
with pytest.raises(ValueError):
    rules.spell.options.variant = "en_INVALID"

# setting the variant to a valid variant enables spellchecking
rules.spell.options.variant = "en_GB"

# any iterable works here, but it has to be assigned with =, the returned object is not mutable
rules.spell.options.whitelist = ["nlprule"]

assert rules.correct("nlprule has spelllchcking!") == "nlprule has spellchecking!"

I am quite happy with how this turned out. I'd appreciate some feedback regarding the API from people who were previously interested in this (@theelgirl, @drahnr) and of course from anyone else who is reading this. I might have missed something.

As usual, implementing this took more effort than anticipated (LT uses quite a bite more than just one wordlist for spellchecking) but I'm excited to add this functionality! There are some smaller things still left to do to get this over the finish line (see the checklist above). I hope to merge this PR & release next weekend.

bminixhofer avatar Mar 14 '21 17:03 bminixhofer

Note that the review is a first glance and by no means complete, I'll have another go at the impl details tomorrow :crossed_fingers:

drahnr avatar Mar 18 '21 10:03 drahnr

Hi, thanks for taking a look!

I'll have another go at the impl details tomorrow

I was asking about the public API (thanks for your feedback on that!), the PR isn't fully ready for review.

bminixhofer avatar Mar 18 '21 11:03 bminixhofer

Hi, thanks for taking a look!

I'll have another go at the impl details tomorrow

I was asking about the public API (thanks for your feedback on that!), the PR isn't fully ready for review.

Ah, overlooked that. Let me know once you want a more introspective review!

drahnr avatar Mar 18 '21 12:03 drahnr

I will hold off on merging this for some time to explore better modularization first, see https://github.com/bminixhofer/nlprule/issues/50#issuecomment-808761738.

bminixhofer avatar Mar 27 '21 16:03 bminixhofer

let tokenizer = Tokenizer::new("path/to/tokenizer.bin").unwrap();
let mut rules = Rules::new("path/to/rules.bin", tokenizer.into()).unwrap();

Will there be a way to load from in-memory data rather than a path?

I find it both inconvenient and irritating that crates like ttspico don't let me bundle the dependencies into the binary with include_bytes! in situations where the primary goal is ease of deployment and maintainability. (eg. dropping a single binary into ~/bin/ or ~/.cargo/bin ...possibly built against a -musl target for extra protection against dependency breakage.)

...especially when Cargo's solutions for making sure that non-embedded dependencies wind up where the thing looks for them across all different ways of building and running artifacts aren't the greatest.

ssokolow avatar Jul 21 '22 15:07 ssokolow