diff-match-patch
diff-match-patch copied to clipboard
Javascript line diff breaks beyond 65K lines
I try using The google diff-match-path library from nodejs for line diffs: https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs. I get wrong patches when in sum the lines of both inputs goes beyond 65,536 (2^16) lines.
Is that a bug (in my code or diff-match-patch), or am I hitting a known limitation of javascript/nodejs? Anything I can do to use d-m-p with larger files?
This script reproduces the problem
var diff_match_patch = require("diff-match-patch")
// function copied from google wiki
// https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs
function diff_lineMode(text1, text2) {
var dmp = new diff_match_patch();
var a = dmp.diff_linesToChars_(text1, text2);
var lineText1 = a.chars1;
var lineText2 = a.chars2;
var lineArray = a.lineArray;
var diffs = dmp.diff_main(lineText1, lineText2, false);
dmp.diff_charsToLines_(diffs, lineArray);
return diffs;
}
// reproduce problem by diffing string with many lines to "abcd"
for (let size = 65534; size < 65538; size += 1) {
let text1 = "";
for (let i = 0; i < size; i++) {
text1 += i + "\n";
}
var patches = diff_lineMode(text1, "abcb")
console.log("######## Size: " + size + ": patches " + patches.length)
for (let i = 0; i < patches.length; i++) {
// patch[0] is action, patch[1] is value
var action = patches[i][0] < 0 ? "remove" : (patches[i][0] > 0 ? "add" : "keep")
console.log("patch" + i + ": " + action + "\n" + patches[i][1].substring(0, 10))
}
}
Giving these outputs (using substring in code above to shorten outputs):
######## Size: 65534: patches 2
patch0: remove
0
1
2
3
4
patch1: add
abcb
######## Size: 65535: patches 2
patch0: remove
0
1
2
3
4
patch1: add
######## Size: 65536: patches 2
patch0: keep
0
patch1: remove
1
2
3
4
5
######## Size: 65537: patches 3
patch0: remove
0
patch1: keep
1
patch2: remove
2
3
4
5
6
Using
$ node --version v6.3.1
cat package.json
{
"name": "dmp_bug",
"version": "1.0.0",
"description": "reproduce issue with diff match patch",
"main": "dmpbug.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"diff-match-patch": "^1.0.4"
}
}
The line/word patch mechanism presented here is to encode each unique line or word as a unique Unicode character. In ES5 only the first 16 bits are accessible (using String.fromCharCode).
However, from ES6, one can use String.fromCodePoint which should allow for 21 bits: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint
Try replacing String.fromCharCode
with String.fromCodePoint
and charCodeAt
with codePointAt
.
Thanks, for the quick reply. I wonder if the library could detect an overflow of unique lines and error out instead of producing invalid patches. Feel free to close if nothing is to be done about it.
I'll leave this issue open, with the intention of adding a check for the existence of fromCodePoint and codePointAt. This way it will still run on ES5, but will have added capability if run on ES6+.
Also note that e.g. the java and python implementation seem to suffer the same problem, and they both have similar handling:
python2 https://github.com/google/diff-match-patch/blob/895a9512bbcee0ac5a8ffcee36062c8a79f5dcda/python2/diff_match_patch.py#L429
if len(lineArray) == maxLines:
# Bail out at 65535 because unichr(65536) throws.
line = text[lineStart:]
lineEnd = len(text)
lineArray.append(line)
java https://github.com/google/diff-match-patch/blob/895a9512bbcee0ac5a8ffcee36062c8a79f5dcda/java/src/name/fraser/neil/plaintext/diff_match_patch.java#L547
if (lineArray.size() == maxLines) {
// Bail out at 65535 because
// String.valueOf((char) 65536).equals(String.valueOf(((char) 0)))
line = text.substring(lineStart);
lineEnd = text.length();
}
Same for dart.
@NeilFraser can the fromCodePoint
and codePointAt
solution be added as an option? I use this library via npm, and that sounds like a great option to have.
@NeilFraser Thank you for your solution. It actually works with up to 2^21 file rows. By simply replacing String.fromCharCode with String.fromCodePoint and String.charCodeAt with String.codePointAt, and changing maxLines to 2M, the magic happens, and big file comparisons are now functional.
@NeilFraser or @google-admin, considering that all browsers support ES6, could you please verify and release a fix for the rest of the world?
Note: Upon reviewing, your diff implementation appears to be the fastest on the market. Thanks again.