fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

process is slow

Open gamesguru opened this issue 6 years ago • 3 comments

takes 30 seconds to process 1.1 million product names

the npm library fuzzysort is much faster currently

gamesguru avatar Jun 08 '19 17:06 gamesguru

You should provide examples for people to reproduce your test. Some of the interesting points:

  • what's the ratio you used with process
  • did you use the faster version using python-Levenshtein
  • whats the score of fuzzysort based on

maxbachmann avatar Dec 16 '20 15:12 maxbachmann

Hi, you could obtain data for example here, split by space into list: https://www.damienelliott.com/wp-content/uploads/2020/07/Lorem-ipsum-dolor-sit-amet.txt

As for the other points,

  1. I used token_set_ratio, and though it appears to be the slowest.. only by a factor of 1.5-2x, which means ratio or simple_ratio is still taking up to tens of seconds.
  2. Yes, I have installed python-Levenshtein. I dispelled any such warning messages early on.
  3. Not exactly sure, afaik she has implemented a custom in-place algorithm: https://github.com/farzher/fuzzysort/blob/master/fuzzysort.js

gamesguru avatar Dec 16 '20 16:12 gamesguru

I performed a quick test using the following test code:

setup="""
from {} import process, fuzz

with open("Lorem-ipsum-dolor-sit-amet.txt") as fw:
  text = fw.read()

words = text.split()
query = words[0]
words = words[1:]
"""

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None, score_cutoff=80)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("fuzzywuzzy"), number=1
))

This compares the runtime for FuzzyWuzzy and an improved version of these algorithms from RapidFuzz The runtime I got was:

RapidFuzz: 1.22120649999124 sec
FuzzyWuzzy: 8.835493099992163 sec

So it might be enough for your requirements to use RapidFuzz. FuzzySort appears to use a completely different algorithm, that is not based on the Levenshtein distance. So it might be an option to add a similar algorithm.

maxbachmann avatar Dec 16 '20 17:12 maxbachmann