Backtracking algorithm does not produce longest common substring diff when collections contain duplicated elements

Open numist opened this issue 7 years ago • 1 comments

Unit test:

    func testLCSubstr() {
        // `after` alignment represents current behaviour. Proper LCSubstr result should remove elements at offsets 0...9
        let before = ["c", "a", "l", "l", " ", "t", "h", "i", "s", " ", "s", "o", " ", "w", "e", " ", "c", "a", "n", " ", "o", "v", "e", "r", "r", "i", "d", "e"]
        let after =  [                                        "s",           "o", " ", "w", "e", " ", "c", "a", "n", " ", "o", "v", "e", "r", "r", "i", "d", "e"]
        let diff = after.difference(from: before)
        let expected = OrderedCollectionDifference<String>([
            .remove(offset: 0, element: "c", associatedWith:nil),
            .remove(offset: 1, element: "a", associatedWith:nil),
            .remove(offset: 2, element: "l", associatedWith:nil),
            .remove(offset: 3, element: "l", associatedWith:nil),
            .remove(offset: 4, element: " ", associatedWith:nil),
            .remove(offset: 5, element: "t", associatedWith:nil),
            .remove(offset: 6, element: "h", associatedWith:nil),
            .remove(offset: 7, element: "i", associatedWith:nil),
            .remove(offset: 8, element: "s", associatedWith:nil),
            .remove(offset: 9, element: " ", associatedWith:nil)
        ])
        XCTAssertEqual(diff, expected)
    }

The reporter was expecting diffing to produce a longest common substring solution, which I feel is reasonable when diffing strings.

Dec 18 '18 21:12 numist

Unit test produces search state:

                                                     [(x:0,y:0)],
                                                [(x:0,y:1), (x:1,y:0)],
                                           [(x:0,y:2), (x:1,y:1),  (x:2,y:0)],
                                      [(x:0,y:3), (x:1,y:2),  (x:2,y:1), (x:3,y:0)],
                                 [(x:0,y:4), (x:1,y:3),  (x:2,y:2), (x:3,y:1), (x:4,y:0)],
                            [(x:0,y:5), (x:1,y:4),  (x:2,y:3), (x:3,y:2), (x:4,y:1), (x:5,y:0)],
                       [(x:2,y:8), (x:1,y:5),  (x:2,y:4), (x:3,y:3), (x:5,y:3), (x:5,y:1), (x:6,y:0)],
                  [(x:2,y:9), (x:3,y:8),  (x:2,y:5), (x:3,y:4), (x:5,y:4), (x:6,y:3), (x:6,y:1), (x:7,y:0)],
            [(x:2,y:10), (x:3,y:9),  (x:4,y:8), (x:3,y:5), (x:5,y:5), (x:6,y:4), (x:7,y:3), (x:7,y:1),  (x:9,y:1)],
      [(x:2,y:11), (x:3,y:10), (x:5,y:10), (x:5,y:8), (x:5,y:6), (x:6,y:5), (x:7,y:4), (x:8,y:3), (x:10,y:3), (x:10,y:1)],
[(x:2,y:12), (x:3,y:11), (x:5,y:11), (x:6,y:10), (x:6,y:8), (x:6,y:6), (x:7,y:5), (x:8,y:4), (x:10,y:4), (x:11,y:3), (x:nil,y:nil)]

…which doesn't encode the longest substring solution at all. That would look more like:

                                                     [(x:0,y:0)],
                                                [(x:#,y:#), (x:1,y:0)],
                                           [(x:#,y:#), (x:#,y:#), (x:2,y:0)],
                                      [(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:3,y:0)],
                                 [(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:4,y:0)],
                            [(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:5,y:0)],
                       [(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:6,y:0)],
                  [(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:7,y:0)],
            [(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:8,y:0)],
      [(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:9,y:0)],
[(x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:#,y:#), (x:nil,y:nil)]

The matching element is throwing Myers' off. Reduced test case:

func testLCSubstr() {
    // `after` alignment represents current diff behaviour
    // result should remove element at offset 0
    // for a longest common substring of length 2
    let before = ["s", "s", "o"]
    let after =  ["s",      "o"]
    let diff = after.difference(from: before)
    let expected = OrderedCollectionDifference<String>([
        .remove(offset: 0, element: "s", associatedWith:nil)
    ])
    XCTAssertEqual(diff, expected)
}

This problem appears to be a direct result of the greedy nature of the algorithm—Myers' is broken by collections containing duplicated elements. Unfortunately, identifying duplicate elements is very expensive when using a user-defined equality function (or when elements are not at least Comparable or Hashable).

Dec 20 '18 23:12 numist