textacy icon indicating copy to clipboard operation
textacy copied to clipboard

use faster implementations of edit distances

Open maxbachmann opened this issue 2 years ago • 2 comments

This replaces the usage of jellyfish in the edit distance implementation with rapidfuzz, which provides significantly faster implementations of these metrics. In addition it provides normalized versions of these metrics. The only special handling required here is that rapidfuzz considers two empty strings as similar, while the implementation in textacy considers them not similar.

maxbachmann avatar Apr 17 '23 18:04 maxbachmann

Hi @maxbachmann, thanks for submitting a PR. I've heard of rapidfuzz but hadn't ever poked around in it! :)

I see that your changes make the textacy code slightly simpler, but to me that's not a compelling reason to switch from an otherwise satisfying dependency. (In fact, I switched from fuzzywuzzy to jellyfish years ago, partially for license reasons.) You claim that your implementations of these distance metrics are faster -- which could indeed be a good reason to switch! Would you be willing to benchmark times on texts of varying lengths, comparing the two implementations?

bdewilde avatar Apr 19 '23 23:04 bdewilde

Sure I have a script for this laying around. It uses random ascii strings of varying lengths (the findings are the same for different unicode ranges).

Levenshtein: levenshtein

JaroWinkler: jarowinkler

Hamming: hamming

I expected everything except for the extreme performance difference in the hamming distance. This appears to be a performance bug in the new rust backend, since the old C implementation performed quite a bit better: hamming_old

I reported this to the jellyfish maintainer as well.

maxbachmann avatar Apr 20 '23 00:04 maxbachmann