RapidFuzz icon indicating copy to clipboard operation
RapidFuzz copied to clipboard

partial_ratio breaks when short string is tool long

Open rfara opened this issue 2 years ago • 1 comments

Hi,

Thanks for writing rapidfuzz - it's been really helpful for me.

I noticed some unexpected behaviour in the partial_ratio function. In my case, when the length of the shorter string exceeds 225, then partial_ratio returns a much lower value. Here's my example (please excuse the very long strings):

from rapidfuzz import fuzz
long_string = 'TTAGCGCTACCGGTCGCCACCATGGTTTTCTAAGGGGAGGCCGTCATCAAAAGAGTTCATGTAGCACGAAGTCCACCTTTGAAGGATCGATGAATGGCCATGAATTCGAAATCGAGGGGAGGGCGAGAGAGGGCCGGCCTTACGAGGGCACACCCAAACTGCCAAACTGAAAGTGACCAAAGGCGGCCCGTTACCATTCTCCTGGGACATACTGTAAGTGCATGGCACCACGCTCTATTTCTTAAAAAAAGTGTAGGGTCTGGCGCCCTCGGGGGCGGCTTAGGAAAAGAGGCCTGACCAATTTTTGTCTCTTATAGGTCACCACAGTTCATGTACGGAAGCAGAGCGTTCACGAAGCACCCAGCTGACATCCCGGACTACTATGACAGAGCTTCCCGGAAGGACTCAAGTGGGAGCGGGTCATGAACTTCGAGGACGGTGGGGCAGTGACTGTGACACAGGACACCAGCCTGAAGATGGAACTCTTATCTACAAAGTAAAGCTAAGAGGAACCAACTTCCCGCCAGATGGGCCCGTTATGCAAAAGAAAACGATGGGGTGGGAAGCTTCTGCAGAGCGCCTTTACCCCGAGGATGGCGTCCTTAAGGGGGATATCAAAATGGCGCTACGCCTTAAGGATGGAGGCAGATATTTGGCAGACTTCAAAACAACATTACAAGGCGAAGAAGCCAGTCCAGATGCCTGGAGCTTGCAATGGTAAGCACCTCTGCCTGCCCCGCTAGTTGGGTGTGAGTGGCCCAGGCAGCCGCCTGCATTTAGCTCTAGCCGGGGTACGGGTGCCCCTTGATGCCTGAGGCCTCTCCTGTGGCTGAGGCGACTGGCCCAGAGTCTGGGTCTCCTCGAGGGTGGCCATCTGGCGTCACCTGTCATCTGCCACCTCTGACCCCTGCCTCTCTCCTCACAGTTGACCGGAAGCTCGACATAACGAGTCACAACGAGGACTACACAGTTGTCGAGCAGTACGAACGTTCCGAGGGTCGACACTCAACTGGCAGGATGGATGAGCTTTTACAAAGGGCGGGGGCGGAGGAAGCGGAGGAGGAGGAAGTGGTGGAGGAGGCTCGAAAGGTAAGTATCAGGGTTGCAGCGTTTCTCTGACCTCATATTCCAATGGATGTGTGAGAAGCATAGTGAGATCCGTTTACCCCTTTTGCTCAATTCTCACGTGGCTGTAGTCGTGTTTATAAGTCTGATCGTAATGGCAGCTTGGTCTGCGTGCCTTGAAATTGTGGCCCCCACATGCATAATAAACGATCCTCTAGCACTACTTTCTGTCGAGCCACCTCAGCGCCCGTACAGTAATGTCTACAGCGCGTCTAACCCGACAAATGCGTTTCTTTCTCTCCTAGAACGAAAGATTACGGATCACAGAAACGTCTCGGAAAGTCCAAATAGAAAGAACGAGAAAGAAGAAAGTGAAGGATCACAAGAGCAACTCGAAAGAAAGAGACATAAGAAGGAACTCAGAAAAGGATGACAAGTATAAAAACAAAGTGAAGAAAAGAGCGAAGAGCAGAGTAGAAGCAAGAGTAAAGAGAAGAAGAGCAAATCGAAGGAAAGGTAAGTGGCTTTCAAGAACATTGGTAAAACGTCATGTGTATTGCGGTTCCATGCTTACACAAATTCGTTCGCTTGTTTTCAGGGACTCGAAACACAACAGAAACGAAGAGAAGAGAATGAGAAGCAGAAGCAAAGGAAGAGACCATGAAAATGTCAAGGAAAAAGAAAAAACAGTCCGATAGCAAAGGCAAAGACCAGGAGCGGTCTCGGTCGAAGGAAAAATCTAAACAACTTGAATCAAAATCTAACGAGCATGGTAAGTTCGCGAGACACTAAGTTGATTCTTAGTGTTTAGACGTGAAACTCCCTTGGAAGGTTTAACGAATACTGTTAATATTTTCAGATCACTCAAAATCCAAAAGAACCGACGGGCACAATCCCGGAGCCGTGAATGTGATATAACCAAGGAAGCACAGTTGCAATTCGAGAACAAGAGAAAGAAGCAGAAGTAGAGAGATCGCTCGAGAAGAGTGAGAAGCAGAACACATGATAGAGACAGAAGCCGGTCGAAAGAATACCACCGCTACAGAGAACAAGGTAAGCATGACTACTTGAGTGTAAATACGTTGTGATAGAGATGAAAAACAAAACCGAACATTACTTTGGGTAATAATTAACTTTTTTTTAATAGAATATCGGGAGAAAGGAAGGTCGAGAAGCAGAGAAAGAAGGACGCCTCAGGAAGAAGCCGTTCGAAAGACAGAAGGAGAAGGAGAAGAGATTCGAAAGTTCAGAGCGTGAAGAGTCTCAATCGCGTAATAAAGACAAGTACGGGAACCAAGAAAGTAAAAGTTCCCACAGGAAGAACTCTGAAGAGCGAGAAAAGTAAAAAAGGGTTTCCTGTTTTTTGCCTATTTTGGGTAAAGGGGTTGATGGAGAAACAGGTGTGTGGACTGCTGAGGAGTGAGTTAGAATAAATGGTGGTATCACTTCTTCAATGCTACTACAATGGAACAACAGTCGTTACCTGTTTTAAGTTCGTGGCGTCTTATGCTCCGGACAGGGACAGATAGGCGGTTGACAGAGAGTTAAGATCTAGTACACTGGGTTTCCTAAATGTAAGAATTGGCCCGAATCCGGCCTAATATGCGAACTTTGTGCTACCAAGCGAGCGGGAAGCTAAGGGTGGGGAATTGCGGGTTTAATGGACCATCTCATGAGTCTAGCAGTTAATGTATCCTATCTTCCAAACAGGAATGTATTCGAAAGAGTAGAGACCATAATTCGTCTAACAACTCAAGGAAAAGAAGGCGGAGTAGAGCCGATTCCGAACCCTTTGCTAGGACTAGATAGCACGTGAACCTAGACTGTCTCTGAGACTGCGCCATTACGTCTCGATCAGTAACGATTGCATCGCGAGGCTGTGGATGTAAAACCTCTGCTGACCTTGACTGACTGAGATACAATGCCTTCAGCAATGCGTGGCAG'

short_string = 'GTAAGGGTTTCCTGTTTTTTGCCTATTTTGGGTAAAGGGGGGTTGATGGAGAAACAGGTGTGTGGACTGCTGAGGAGTGAGTTAGAATAAATGGTGGTATCACTTCTTCAATGCTACAATGGAACAACAGTCGTTACCTGTTTTAAGTTCGTGGCGTCTTATGCTCCGGACAGGGACAGATAGGCGGTTAGACAGAGAGTTAAGATCTAGTACACTGGGTTTCCTAAATGTAAAAATTGGCCCGAATCCGGCCTAATATGCGAACTTTGTGCTACCAAGCGAGCGGGAAGCTAAGGGTGGGGAGTGCGGGTTTAATGGACCATCTCGCAGGTCTAGCAGTTAATGTATCCTATCTTCCAAACAG'

print(fuzz.partial_ratio(long_string, short_string))

print(fuzz.partial_ratio(long_string, short_string[:225]))

print(fuzz.partial_ratio(long_string, short_string[:226]))

This returns:

64.01098901098901
97.77777777777777
65.04424778761062

The value of 97.77777 is what I'd expect, as the short string aligns to the long string with ~10 indels and is of length ~350. I can simply cap the length of my short strings and my analysis will work just fine. It might be useful to have some kind of warning when strings become long, though, if it's difficult to change the implementation to make partial_ratio work for long strings.

I'm using rapidfuzz 2.0.11 in a Jupyter notebook running python 3.9.12.

Thanks!

rfara avatar May 25 '22 13:05 rfara

I just looked into this. This is a known issue, I am not sure how to handle better yet. partial_ratio in rapidfuzz uses two different implementations depending on the string lengths:

  1. for text lenghts <= 64 of the shorter string, I calculate all similarities and only skip the ones that are guaranteed to lead to worse results.
  2. for text lenghts > 64 of the shorter string I use the same implementation as fuzzywuzzy, which searches for the longest common substrings and then uses those as starting points for the similarity calculations

Variant 1) is guaranteed to have the correct results and is the faster option for short strings. However for longer strings the Levenshtein calculation becomes very slow. Variant 2) is faster for these longer strings (which should be the reason it is used in fuzzywuzzy). However it can have incorrect results in edge cases, since the optimal alignment does not have to start at one of the longest common substrings.

I do provide a function partial_ratio_alignment, which shows these different alignments. Note that this is not linked to the length being > 255, but to the string content:

>>> fuzz.partial_ratio_alignment(long_string, short_string[:225])
ScoreAlignment(score=97.77777777777777, src_start=2410, src_end=2635, dest_start=0, dest_end=225)
>>> fuzz.partial_ratio_alignment(long_string, short_string[:226])
ScoreAlignment(score=65.04424778761062, src_start=1006, src_end=1232, dest_start=0, dest_end=226)
>>> fuzz.partial_ratio_alignment(long_string, short_string[2:226])
ScoreAlignment(score=65.17857142857143, src_start=1008, src_end=1232, dest_start=0, dest_end=224)

the same issue exists in fuzzywuzzy even though in this particular case it never finds the optimal alignment:

from fuzzywuzzy import fuzz

>>> fuzz.partial_ratio(long_string, short_string)
66
>>> fuzz.partial_ratio(long_string, short_string[:225])
64
>>> fuzz.partial_ratio(long_string, short_string[:226])
64

even though the optimal alignment in both cases would be:

>>> fuzz.partial_ratio(long_string[2410:2636], short_string[:226])
97.78761061946902
>>> fuzz.partial_ratio(long_string[2410:2635], short_string[:225])
97.77777777777777

A simple fix would be to always use Variant 1. However this would change the runtime from O(N*M) to O(N*N*M), which really hurts for long strings.

maxbachmann avatar May 25 '22 14:05 maxbachmann

I updated my implementation to find the correct results in all cases even for needles with a length over 64. In addition I improved the performance. E.g. the similarity for your example string is now calculated 4x faster.

maxbachmann avatar Sep 11 '22 16:09 maxbachmann