affinegap icon indicating copy to clipboard operation
affinegap copied to clipboard

Weird Behavior Caused by abbreviation_scale

Open foggydae opened this issue 5 years ago • 1 comments

Hi,

I don't know if this is an issue/bug or is intentional, but here is an example of a weird behavior:

>>> affinegap.affineGapDistance("TED A", "TD A", 
+       matchWeight = 0, spaceWeight = 2, gapWeight = 10, 
+       abbreviation_scale = 0.125)
12.0 # Correct. Open a gap and insert a space for "E".

>>> affinegap.affineGapDistance("TESD A", "TD A", 
+       matchWeight = 0, spaceWeight = 2, gapWeight = 10, 
+       abbreviation_scale = 0.125)
14.0 # Correct. Continue the gap and insert a space for "S".

>>> affinegap.affineGapDistance("TESTD A", "TD A", 
+       matchWeight = 0, spaceWeight = 2, gapWeight = 10, 
+       abbreviation_scale = 0.125)
16.0 # Correct. Continue the gap and insert a space for "T".

>>> affinegap.affineGapDistance("TESTED A", "TD A", 
+       matchWeight = 0, spaceWeight = 2, gapWeight = 10, 
+       abbreviation_scale = 0.125)
16.25 # Weird. 
      # This is because the second "E" is at position 5, which is greater 
      # than the length of the second string. 
      # So the score for the additional space for "E" is scaled.

I believe it is triggered by https://github.com/dedupeio/affinegap/blob/853f3d3d02d9a9adc1ec92dd9448949f51748e87/affinegap/affinegap.pyx#L79 by accident.

It would be great if you could take a look. Thanks.

foggydae avatar May 29 '19 15:05 foggydae

I'm not super familiar with the algorithm, sorry. Can you explain a little more? What is the expected match pattern? 1:

TESTED A
TxxxxD A

or 2:

TESTED A
yyyTzD A

and can you explain how you think it's calculating that 16.25?

NickCrews avatar Sep 19 '22 17:09 NickCrews