diff-match-patch icon indicating copy to clipboard operation
diff-match-patch copied to clipboard

Javascript line diff breaks beyond 65K lines

Open tkruse opened this issue 5 years ago • 6 comments

I try using The google diff-match-path library from nodejs for line diffs: https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs. I get wrong patches when in sum the lines of both inputs goes beyond 65,536 (2^16) lines.

Is that a bug (in my code or diff-match-patch), or am I hitting a known limitation of javascript/nodejs? Anything I can do to use d-m-p with larger files?

This script reproduces the problem

var diff_match_patch = require("diff-match-patch")

// function copied from google wiki 
// https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs
function diff_lineMode(text1, text2) {
  var dmp = new diff_match_patch();
  var a = dmp.diff_linesToChars_(text1, text2);
  var lineText1 = a.chars1;
  var lineText2 = a.chars2;
  var lineArray = a.lineArray;
  var diffs = dmp.diff_main(lineText1, lineText2, false);
  dmp.diff_charsToLines_(diffs, lineArray);
  return diffs;
}

// reproduce problem by diffing string with many lines to "abcd"
for (let size = 65534; size < 65538; size += 1) {
  let text1 = "";
  for (let i = 0; i < size; i++) {
    text1 += i + "\n";
  }

  var patches = diff_lineMode(text1, "abcb")
  console.log("######## Size: " + size + ": patches " + patches.length)
  for (let i = 0; i < patches.length; i++) {
    // patch[0] is action, patch[1] is value
    var action = patches[i][0] < 0 ? "remove" : (patches[i][0] > 0 ? "add" : "keep")
    console.log("patch" + i + ": " + action + "\n" + patches[i][1].substring(0, 10))
  }
}

Giving these outputs (using substring in code above to shorten outputs):

######## Size: 65534: patches 2
patch0: remove
0
1
2
3
4

patch1: add
abcb
######## Size: 65535: patches 2
patch0: remove
0
1
2
3
4

patch1: add

######## Size: 65536: patches 2
patch0: keep
0

patch1: remove
1
2
3
4
5

######## Size: 65537: patches 3
patch0: remove
0

patch1: keep
1

patch2: remove
2
3
4
5
6

Using

$ node --version v6.3.1
cat package.json
{
  "name": "dmp_bug",
  "version": "1.0.0",
  "description": "reproduce issue with diff match patch",
  "main": "dmpbug.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "diff-match-patch": "^1.0.4"
  }
}

tkruse avatar Dec 04 '18 01:12 tkruse

The line/word patch mechanism presented here is to encode each unique line or word as a unique Unicode character. In ES5 only the first 16 bits are accessible (using String.fromCharCode).

However, from ES6, one can use String.fromCodePoint which should allow for 21 bits: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint

Try replacing String.fromCharCode with String.fromCodePoint and charCodeAt with codePointAt.

NeilFraser avatar Dec 04 '18 01:12 NeilFraser

Thanks, for the quick reply. I wonder if the library could detect an overflow of unique lines and error out instead of producing invalid patches. Feel free to close if nothing is to be done about it.

tkruse avatar Dec 04 '18 02:12 tkruse

I'll leave this issue open, with the intention of adding a check for the existence of fromCodePoint and codePointAt. This way it will still run on ES5, but will have added capability if run on ES6+.

NeilFraser avatar Dec 04 '18 02:12 NeilFraser

Also note that e.g. the java and python implementation seem to suffer the same problem, and they both have similar handling:

python2 https://github.com/google/diff-match-patch/blob/895a9512bbcee0ac5a8ffcee36062c8a79f5dcda/python2/diff_match_patch.py#L429

 if len(lineArray) == maxLines:
        # Bail out at 65535 because unichr(65536) throws.
        line = text[lineStart:]
        lineEnd = len(text)
lineArray.append(line)

java https://github.com/google/diff-match-patch/blob/895a9512bbcee0ac5a8ffcee36062c8a79f5dcda/java/src/name/fraser/neil/plaintext/diff_match_patch.java#L547

if (lineArray.size() == maxLines) {
      // Bail out at 65535 because
      // String.valueOf((char) 65536).equals(String.valueOf(((char) 0)))
      line = text.substring(lineStart);
      lineEnd = text.length();
}

Same for dart.

tkruse avatar Dec 04 '18 09:12 tkruse

@NeilFraser can the fromCodePoint and codePointAt solution be added as an option? I use this library via npm, and that sounds like a great option to have.

kevin-lindsay-1 avatar Dec 14 '20 22:12 kevin-lindsay-1

@NeilFraser Thank you for your solution. It actually works with up to 2^21 file rows. By simply replacing String.fromCharCode with String.fromCodePoint and String.charCodeAt with String.codePointAt, and changing maxLines to 2M, the magic happens, and big file comparisons are now functional.

@NeilFraser or @google-admin, considering that all browsers support ES6, could you please verify and release a fix for the rest of the world?

Note: Upon reviewing, your diff implementation appears to be the fastest on the market. Thanks again.

AvshT avatar Feb 12 '24 08:02 AvshT