fuzzywuzzy
fuzzywuzzy copied to clipboard
utils.full_process executed when processor=None
Great and very helpful tool! Thank you!
One thing I noticed is that even when process.extractOne (and others) have processor set to None, utils.full_process is still executed several times. Probably because of
https://github.com/seatgeek/fuzzywuzzy/blob/88951621d081095359f37fbf6f282f6e54336a14/fuzzywuzzy/process.py#L100
This generates two times the same output:
from fuzzywuzzy import process
query = "123 .... "
choices = ["123", query]
print(process.extract(query, choices))
print(process.extract(query, choices, processor=None))
Output:
[('123', 100), ('123 .... ', 100)]
[('123', 100), ('123 .... ', 100)]
Expected would be that without a processor the 1:1 match is better. So some thing like this:
[('123', 100), ('123 .... ', 100)]
[('123 .... ', 100), ('123', 90)]
In Fuzzywuzzy the processor argument only allows the usage of additional preprocessing. However, it does not provide a way to disable the preprocessing inside the scorer. So when calling
process.extract(query, choices, processor=None)
The string is still preprocessed, since the default scorer fuzz.WRatio preprocesses strings by default. To disable this you would have to use:
process.extract(query, choices, processor=None, scorer=partial(fuzz.WRatio, full_process=False))
I agree that this is very counter-intuitive, which is why I use the behavior you expected in RapidFuzz.