fuzzywuzzy-rs icon indicating copy to clipboard operation
fuzzywuzzy-rs copied to clipboard

Implement optimal string alignment algorithm

Open logannc opened this issue 5 years ago • 2 comments

See https://github.com/maxbachmann/rapidfuzz/issues/13 for generous details from another fuzzywuzzy compatible project author.

Essentially, partial_ratio attempts to align strings optimally, then take the ratio of the aligned string subsets. The method of alignment used in fuzzywuzzy is.. pretty bad, actually.

Implement Smith-Waterman for sequence alignment to power partial_ratio. Legacy partial_ratio behaviour will be relegated to a compatibility function (which, if there are more, might be put into a compatibility module).

logannc avatar Oct 06 '20 03:10 logannc

Example of a random python Smith-Waterman alignment:

>>> sw.align("what about supergreatfantastic theatre", "superduperfantastic theater about what").dump()
Query:  1 superdupe-rfantastic theater 27
          ||||| ..| .||||||||||||||| |
Ref  : 12 super-greatfantastic theat-r 37

Score: 38
Matches: 22 (78.6%)

This seems clearly superior to the existing alignment algorithm.

logannc avatar Oct 11 '20 04:10 logannc

In particular when comparing the following:

>>> sw.align("what about superduperfantastic", "superduperfantastic about what").dump()
Query:  1 superduperfantastic 19
          |||||||||||||||||||
Ref  : 12 superduperfantastic 30

Score: 38
Matches: 19 (100.0%)
Mismatches: 0
CIGAR: 19M
>>> partial_ratio("what about superduperfantastic", "superduperfantastic about what")
63

logannc avatar Oct 11 '20 04:10 logannc