fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

implemented token_sim_ratio() function with cosine similarity

Open Exquisition opened this issue 4 years ago • 4 comments
trafficstars

Implemented solution to the following issue: https://github.com/seatgeek/fuzzywuzzy/issues/272

token_sim_ratio(s1, s2 ... ) robustly handles any issues associated with lexicographic sorting of tokens for the 2nd string introduced by fuzz.token_sort_ratio(s1, s2...). The similarity is calculated using cosine similarity, other similarity measures could be integrated easily (built-in leveinstein, Jaro-Winkler, etc).

Exquisition avatar Dec 17 '20 21:12 Exquisition

Love the idea! addresses one of the main cases where you would get sub-optimal results from this. Was messing around with porting this PR into fuzzball.js.

Wondering.. would it work if a version of token_set also use the similarity sort?

Like maybe using the similarity sort here could work?

sorted_2to1 = " ".join(sorted(diff2to1))

Also partial is handled in _token_sim but currently it will always be False?

nol13 avatar Mar 07 '21 19:03 nol13

Also, not sure maintenance status of this anyway, but can add the new functions to process.py line 97 or it will miss some optimization. Probably some other optimizations hidden in there too if you can say avoid recalculating the counters every time.

nol13 avatar Mar 07 '21 19:03 nol13

Haven't tested but looks order of the arguments might matter though too in some cases? Not sure if ti would matter enough to try running it both ways

nol13 avatar Mar 07 '21 22:03 nol13

Was getting good results in testing, I added experimental support for this into fuzzball.js 1.4! Referenced this PR in the docs. Sorted the arguments by # of tokens or string length before doing the similarity sort, seemed to make sense to give the shorter one more precedence when sorting, and at least it should be consistent. Also have added an option to use the similarity sort when calculating token_set_ratio.

nol13 avatar Apr 02 '21 01:04 nol13