textacy
textacy copied to clipboard
use faster implementations of edit distances
This replaces the usage of jellyfish in the edit distance implementation with rapidfuzz, which provides significantly faster implementations of these metrics. In addition it provides normalized versions of these metrics. The only special handling required here is that rapidfuzz considers two empty strings as similar, while the implementation in textacy considers them not similar.
Hi @maxbachmann, thanks for submitting a PR. I've heard of rapidfuzz but hadn't ever poked around in it! :)
I see that your changes make the textacy code slightly simpler, but to me that's not a compelling reason to switch from an otherwise satisfying dependency. (In fact, I switched from fuzzywuzzy to jellyfish years ago, partially for license reasons.) You claim that your implementations of these distance metrics are faster -- which could indeed be a good reason to switch! Would you be willing to benchmark times on texts of varying lengths, comparing the two implementations?
Sure I have a script for this laying around. It uses random ascii strings of varying lengths (the findings are the same for different unicode ranges).
Levenshtein:

JaroWinkler:

Hamming:

I expected everything except for the extreme performance difference in the hamming distance. This appears to be a performance bug in the new rust backend, since the old C implementation performed quite a bit better:

I reported this to the jellyfish maintainer as well.