fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

Faulty result of partial ratio (without python-Levenshtein)

Open funytan opened this issue 5 years ago • 3 comments

It is known that partial_ratio calculation yields incorrect results for some combinations of strings when it uses the python-Levenshtein SequenceMatcher https://github.com/seatgeek/fuzzywuzzy/issues/79#issue-58664443

However after removing it, for certain string cases, fuzzywuzzy without python-Levenshtein does not work.

> fuzz.partial_ratio('home sweet home', ' home sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

> 13

And interesting enough, installing python-Levenshtein gives the correct score of 100.

This problem seems to happen when the comparison is made between a short and much longer string.

Has anyone faced this before?

funytan avatar Feb 17 '20 05:02 funytan

I noticed if you delete the preceding space in the longer string, then expected score of 100 is achieved. I couldn't figure out why. If your purpose is to get similarity involving long string then removing preceding and trailing spaces just might do the trick, PS: I am using pure-python Sequence matcher and not python-Levenshtein

fuzz.partial_ratio('home sweet home', 'home sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

Out[32]: 100

aniketcomps avatar Mar 05 '20 08:03 aniketcomps

@aniketcomps thanks! It works fine when deleting the preceding space, but when I tried to remove that space and the word and space after that, it fails again! Haha. Im using pure-python Sequence matcher as well.

fuzz.partial_ratio('home sweet home', 'sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

Out[773]: 13

funytan avatar Mar 12 '20 10:03 funytan

As a I described here: https://github.com/seatgeek/fuzzywuzzy/issues/279 this is most likely caused by the automatic junk heuristic of difflib which is not deactivated by fuzzywuzzy

maxbachmann avatar Sep 01 '20 01:09 maxbachmann