Diffing
Diffing copied to clipboard
Backtracking algorithm does not produce longest common substring diff when collections contain duplicated elements
Unit test:
func testLCSubstr() {
// `after` alignment represents current behaviour. Proper LCSubstr result should remove elements at offsets 0...9
let before = ["c", "a", "l", "l", " ", "t", "h", "i", "s", " ", "s", "o", " ", "w", "e", " ", "c", "a", "n", " ", "o", "v", "e", "r", "r", "i", "d", "e"]
let after = [ "s", "o", " ", "w", "e", " ", "c", "a", "n", " ", "o", "v", "e", "r", "r", "i", "d", "e"]
let diff = after.difference(from: before)
let expected = OrderedCollectionDifference<String>([
.remove(offset: 0, element: "c", associatedWith:nil),
.remove(offset: 1, element: "a", associatedWith:nil),
.remove(offset: 2, element: "l", associatedWith:nil),
.remove(offset: 3, element: "l", associatedWith:nil),
.remove(offset: 4, element: " ", associatedWith:nil),
.remove(offset: 5, element: "t", associatedWith:nil),
.remove(offset: 6, element: "h", associatedWith:nil),
.remove(offset: 7, element: "i", associatedWith:nil),
.remove(offset: 8, element: "s", associatedWith:nil),
.remove(offset: 9, element: " ", associatedWith:nil)
])
XCTAssertEqual(diff, expected)
}
The reporter was expecting diffing to produce a longest common substring solution, which I feel is reasonable when diffing strings.
Unit test produces search state:
[(x:0,y:0)],
[(x:0,y:1), (x:1,y:0)],
[(x:0,y:2), (x:1,y:1), (x:2,y:0)],
[(x:0,y:3), (x:1,y:2), (x:2,y:1), (x:3,y:0)],
[(x:0,y:4), (x:1,y:3), (x:2,y:2), (x:3,y:1), (x:4,y:0)],
[(x:0,y:5), (x:1,y:4), (x:2,y:3), (x:3,y:2), (x:4,y:1), (x:5,y:0)],
[(x:2,y:8), (x:1,y:5), (x:2,y:4), (x:3,y:3), (x:5,y:3), (x:5,y:1), (x:6,y:0)],
[(x:2,y:9), (x:3,y:8), (x:2,y:5), (x:3,y:4), (x:5,y:4), (x:6,y:3), (x:6,y:1), (x:7,y:0)],
[(x:2,y:10), (x:3,y:9), (x:4,y:8), (x:3,y:5), (x:5,y:5), (x:6,y:4), (x:7,y:3), (x:7,y:1), (x:9,y:1)],
[(x:2,y:11), (x:3,y:10), (x:5,y:10), (x:5,y:8), (x:5,y:6), (x:6,y:5), (x:7,y:4), (x:8,y:3), (x:10,y:3), (x:10,y:1)],
[(x:2,y:12), (x:3,y:11), (x:5,y:11), (x:6,y:10), (x:6,y:8), (x:6,y:6), (x:7,y:5), (x:8,y:4), (x:10,y:4), (x:11,y:3), (x:nil,y:nil)]
…which doesn't encode the longest substring solution at all. That would look more like:
[(x:0,y:0)],
[(x:#,y:#), (x:1,y:0)],
[(x:#,y:#), (x:#,y:#), (x:2,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:3,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:4,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:5,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:6,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:7,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:8,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:9,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:nil,y:nil)]
The matching element is throwing Myers' off. Reduced test case:
func testLCSubstr() {
// `after` alignment represents current diff behaviour
// result should remove element at offset 0
// for a longest common substring of length 2
let before = ["s", "s", "o"]
let after = ["s", "o"]
let diff = after.difference(from: before)
let expected = OrderedCollectionDifference<String>([
.remove(offset: 0, element: "s", associatedWith:nil)
])
XCTAssertEqual(diff, expected)
}
This problem appears to be a direct result of the greedy nature of the algorithm—Myers' is broken by collections containing duplicated elements. Unfortunately, identifying duplicate elements is very expensive when using a user-defined equality function (or when elements are not at least Comparable or Hashable).