diff-match-patch icon indicating copy to clipboard operation
diff-match-patch copied to clipboard

No linesToWords function?

Open cliffordh opened this issue 4 years ago • 2 comments

I'm reading the line diff and word diff section of the wiki. It is stated to "make a copy of linesToChars" and call it "linesToWords"... it would be great if this was built into the library already. But for now, is there a sample implementation of linesToWords in JavaScript?

cliffordh avatar Dec 07 '19 17:12 cliffordh

Made some progress by literally doing what the docs said... copy the linesToChars function and change the indexOf. I made the following changes:

` diff_match_patch.prototype.diff_linesToWords_ = function(text1, text2) { var lineArray = []; // e.g. lineArray[4] == 'Hello\n' var lineHash = {}; // e.g. lineHash['Hello\n'] == 4

// So we'll insert a junk entry to avoid generating a null character.
lineArray[0] = '';

/**
 * Split a text into an array of strings.  Reduce the texts to a string of
 * hashes where each Unicode character represents one line.
 * Modifies linearray and linehash through being a closure.
 * @param {string} text String to encode.
 * @return {string} Encoded string.
 * @private
 */
/* NEW function */
function regexIndexOf(text, re, i) {
    var indexInSuffix = text.slice(i).search(re);
    return indexInSuffix < 0 ? indexInSuffix : indexInSuffix + i;
}

function diff_linesToWordsMunge_(text) {
    var chars = '';
    // Walk the text, pulling out a substring for each line.
    // text.split('\n') would would temporarily double our memory footprint.
    // Modifying text would create many large strings to garbage collect.
    var lineStart = 0;
    var lineEnd = -1;
    // Keeping our own length variable is faster than looking it up.
    var lineArrayLength = lineArray.length;
    while (lineEnd < text.length - 1) {
        lineEnd = regexIndexOf(text,/\s/,lineStart);//text.indexOf(/^\s+$/, lineStart); //NEW
        if (lineEnd == -1) {
            lineEnd = text.length - 1;
        }
        var line = text.substring(lineStart, lineEnd + 1);

        if (lineHash.hasOwnProperty ? lineHash.hasOwnProperty(line) :
            (lineHash[line] !== undefined)) {
            chars += String.fromCharCode(lineHash[line]);
        } else {
            if (lineArrayLength == maxLines) {
                // Bail out at 65535 because
                // String.fromCharCode(65536) == String.fromCharCode(0)
                line = text.substring(lineStart);
                lineEnd = text.length;
            }
            chars += String.fromCharCode(lineArrayLength);
            lineHash[line] = lineArrayLength;
            lineArray[lineArrayLength++] = line;
        }
        lineStart = lineEnd + 1;
    }
    return chars;
}
// Allocate 2/3rds of the space for text1, the rest for text2.
var maxLines = 40000;
var chars1 = diff_linesToWordsMunge_(text1);
maxLines = 65535;
var chars2 = diff_linesToWordsMunge_(text2);
return {chars1: chars1, chars2: chars2, lineArray: lineArray};

};`

This correctly identifies new words, however, if two new words are side by side then they show up as ONE entry in the resulting diffs (after calling diff_main). I'd like the new words to show up as their own diffs. The individual words do show up as individual elements in the lineArray... thoughts?

On further research, it looks like diff_main squashes edits of the same type together. E.g. two new words side by side become one diff. Any way of keeping them separate?

cliffordh avatar Dec 07 '19 18:12 cliffordh

I have the same issue in the Java code, I replaced the line of code to search by space, but seems like it is not working either

serjant avatar Dec 08 '19 10:12 serjant