fuzzywuzzy
fuzzywuzzy copied to clipboard
implemented token_sim_ratio() function with cosine similarity
Implemented solution to the following issue: https://github.com/seatgeek/fuzzywuzzy/issues/272
token_sim_ratio(s1, s2 ... ) robustly handles any issues associated with lexicographic sorting of tokens for the 2nd string introduced by fuzz.token_sort_ratio(s1, s2...). The similarity is calculated using cosine similarity, other similarity measures could be integrated easily (built-in leveinstein, Jaro-Winkler, etc).
Love the idea! addresses one of the main cases where you would get sub-optimal results from this. Was messing around with porting this PR into fuzzball.js.
Wondering.. would it work if a version of token_set also use the similarity sort?
Like maybe using the similarity sort here could work?
sorted_2to1 = " ".join(sorted(diff2to1))
Also partial is handled in _token_sim but currently it will always be False?
Also, not sure maintenance status of this anyway, but can add the new functions to process.py line 97 or it will miss some optimization. Probably some other optimizations hidden in there too if you can say avoid recalculating the counters every time.
Haven't tested but looks order of the arguments might matter though too in some cases? Not sure if ti would matter enough to try running it both ways
Was getting good results in testing, I added experimental support for this into fuzzball.js 1.4! Referenced this PR in the docs. Sorted the arguments by # of tokens or string length before doing the similarity sort, seemed to make sense to give the shorter one more precedence when sorting, and at least it should be consistent. Also have added an option to use the similarity sort when calculating token_set_ratio.