python-Levenshtein icon indicating copy to clipboard operation
python-Levenshtein copied to clipboard

Seems to give wrong answer for matching_blocks()

Open ojomio opened this issue 10 years ago • 5 comments

If I use this in python list(SequenceMatcher(None, 'asfdsffdfd abcd', 'abcr').get_matching_blocks()) I get this result [(0, 0, 1), (12, 1, 2), (15, 4, 0)] which is clearly wrong, because the longest match is (11, 0, 3), not (12, 1, 2)

ojomio avatar Feb 20 '15 07:02 ojomio

this means that the a was taken from the beginning, and bc in the middle; as far as I understand the documentation of the Levenshtein there is no guarantee or claim anywhere that a block needed to be the longest continuous; it is based on the editops calculations, that consider inserting'sfdsffdfd a' identical in cost to adding 'asfdsffdfd ' to the beginning; the editops then says that one of these is chosen arbitrarily.

ztane avatar Feb 20 '15 13:02 ztane

Problem could come from editops_from_cost_matrix: The if statement from line 5691 https://github.com/ztane/python-Levenshtein/blob/3a7412f38f1991c20d7a2765f30c2bb9cb1e63e0/Levenshtein/_levenshtein.c#L5691-L5699 should be moved to be the first if statement (moved to line 5675). Hope this helps

BobLd avatar Apr 14 '20 19:04 BobLd

Can confirm that @BobLd's suggestion fixes the problem.

burdoto avatar Apr 15 '20 09:04 burdoto

@BobLd this does not really fix the issue. It just shifts it, so it works for this explicit case. As an example

SequenceMatcher(None, "no", "bnonco").get_matching_blocks()

currently works, but breaks when moving the if statement to the top

maxbachmann avatar Nov 15 '20 14:11 maxbachmann

As a note: This is not a bug, since python-Levenshtein does not provide any guarantee, that matching_blocks includes the longest common substring. It bases the matching blocks on one of possibly multiple optimal alignments. Finding this alignment would require a search through all optimal alignments, which is simply not done in the library. Note that this is stated in the documentation of opcodes:

The result is a list of 5-tuples with the same meaning as in SequenceMatcher's get_opcodes() output. But since the algorithms differ, the actual sequences from Levenshtein and SequenceMatcher may differ too

maxbachmann avatar Jan 18 '22 11:01 maxbachmann