fuzzywuzzy
fuzzywuzzy copied to clipboard
process is slow
takes 30 seconds to process 1.1 million product names
the npm library fuzzysort is much faster currently
You should provide examples for people to reproduce your test. Some of the interesting points:
- what's the ratio you used with process
- did you use the faster version using python-Levenshtein
- whats the score of fuzzysort based on
Hi, you could obtain data for example here, split by space into list: https://www.damienelliott.com/wp-content/uploads/2020/07/Lorem-ipsum-dolor-sit-amet.txt
As for the other points,
- I used
token_set_ratio, and though it appears to be the slowest.. only by a factor of 1.5-2x, which meansratioorsimple_ratiois still taking up to tens of seconds. - Yes, I have installed
python-Levenshtein. I dispelled any such warning messages early on. - Not exactly sure, afaik she has implemented a custom in-place algorithm: https://github.com/farzher/fuzzysort/blob/master/fuzzysort.js
I performed a quick test using the following test code:
setup="""
from {} import process, fuzz
with open("Lorem-ipsum-dolor-sit-amet.txt") as fw:
text = fw.read()
words = text.split()
query = words[0]
words = words[1:]
"""
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("rapidfuzz"), number=1
))
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None, score_cutoff=80)", setup=setup.format("rapidfuzz"), number=1
))
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("fuzzywuzzy"), number=1
))
This compares the runtime for FuzzyWuzzy and an improved version of these algorithms from RapidFuzz The runtime I got was:
RapidFuzz: 1.22120649999124 sec
FuzzyWuzzy: 8.835493099992163 sec
So it might be enough for your requirements to use RapidFuzz. FuzzySort appears to use a completely different algorithm, that is not based on the Levenshtein distance. So it might be an option to add a similar algorithm.