hunspell-spellchecker
hunspell-spellchecker copied to clipboard
Massively speed up the `correct` function.
Massively speed up the correct
function.
correct
is slow because it calls edit1
twice recursive. Especially for long words, this can lead to low millions of words to generate and test.
~~Instead: build a trie of letter histograms for every word in the dictionary. Then, add a function findSimilarWords
to find words whose histogram is similar to the one provided plus some changes. Use this to find candidate words for an edit distance of 2 (histogram distance of 4; two replaces). Then do a bidirectional comparison to find the words that actually match.~~
Instead: build a recursive ~~tree~~trie of all the words in the dictionary. Then, add a function findSimilarWords
that walks the tree using a provided word as a guide, making changes during iteration, and aborting when the current set of changes leads to a dead end.
Tree initialization is lazily done once correct
is called. It takes ~~about two seconds~~ ~~less than a second~~ maybe 400ms. After that, correct
is effectively instant (~10ms on en_US).
~~Further ideas for improvements: a proper trie data structure could be used for performance.~~ Done!
rename trie to tree. it's a tree.
On reflection, the histogram part is kind of unnecessary to the algorithm. I'll remove that probably.
Removed. It's even a lot faster to build now! (Maybe three or four times faster.)
It's a trie now.
It's another 25% faster! I think I'll stop here, the code is getting a bit ugly.
great, that would fix #9