hunspell-spellchecker Massively speed up the `correct` function.

Massively speed up the `correct` function.

Open FeepingCreature opened this issue 6 years ago • 6 comments

Massively speed up the correct function.

correct is slow because it calls edit1 twice recursive. Especially for long words, this can lead to low millions of words to generate and test.

~~Instead: build a trie of letter histograms for every word in the dictionary. Then, add a function findSimilarWords to find words whose histogram is similar to the one provided plus some changes. Use this to find candidate words for an edit distance of 2 (histogram distance of 4; two replaces). Then do a bidirectional comparison to find the words that actually match.~~

Instead: build a recursive ~~tree~~trie of all the words in the dictionary. Then, add a function findSimilarWords that walks the tree using a provided word as a guide, making changes during iteration, and aborting when the current set of changes leads to a dead end.

Tree initialization is lazily done once correct is called. It takes ~~about two seconds~~ ~~less than a second~~ maybe 400ms. After that, correct is effectively instant (~10ms on en_US).

~~Further ideas for improvements: a proper trie data structure could be used for performance.~~ Done!

Jul 18 '19 11:07 FeepingCreature

rename trie to tree. it's a tree.

Jul 18 '19 13:07 FeepingCreature

On reflection, the histogram part is kind of unnecessary to the algorithm. I'll remove that probably.

Jul 18 '19 15:07 FeepingCreature

Removed. It's even a lot faster to build now! (Maybe three or four times faster.)

Jul 18 '19 16:07 FeepingCreature

It's a trie now.

Jul 18 '19 18:07 FeepingCreature

It's another 25% faster! I think I'll stop here, the code is getting a bit ugly.

Jul 18 '19 20:07 FeepingCreature

great, that would fix #9

Jul 19 '19 05:07 Ulrikop

hunspell-spellchecker hunspell-spellchecker copied to clipboard

Massively speed up the `correct` function.

hunspell-spellchecker
hunspell-spellchecker copied to clipboard