diff-match-patch icon indicating copy to clipboard operation
diff-match-patch copied to clipboard

Use UTF-32 Aware String in JavaScript

Open sffc opened this issue 6 years ago โ€ข 14 comments

As noted in #10, standard JavaScript strings use UTF-16 encoding, which leads to false arithmetic for a tool like diff_match_patch. The only correct fixes to the problem would be to either rewrite the whole library to operate in offsets instead of char indices or to push the character encoding into an abstraction layer. This PR attempts to do the second option.

At every API boundary, the input string is checked to see if it contains supplemental code points, and if it does, the inputs are converted to utf32_string, an abstraction layer that has a String-like API but does all math in terms of code points instead of code units.

The utf32_string implementation is slower than the standard string implementation, but not prohibitively slower, and it is used only when required (when the string contains supplemental code points).

I tried my best to be consistent with the JavaScript style of the existing project, including pre-ES6 syntax. I did not add Closure type annotations.

sffc avatar May 22 '18 14:05 sffc

@NeilFraser Initial thoughts on this?

sffc avatar Jun 09 '18 05:06 sffc

@NeilFraser Hello?

sffc avatar Aug 31 '18 00:08 sffc

Hi! thank you for providing the patch. I have tested this patch and works for me when I use escaped unicode e.g. '\uD800\uDDE4' as you have on your tests. However if I paste the following code on the chrome console

 var dmp = new diff_match_patch();
 var patches = dmp.patch_make('๐Ÿ‘test', '๐Ÿ‘');
 var patchText = dmp.patch_toText(patches);
 var result = dmp.patch_fromText(patchText);
 var obj = dmp.patch_apply(result, '๐Ÿ‘test');

I get this exception:

diff_match_patch.1.0.1.js:2368 Uncaught TypeError: text.codePointAt is not a function
    at Function.diff_match_patch.utf32_string.from (diff_match_patch.1.0.1.js:2368)
    at diff_match_patch.diff_main (diff_match_patch.1.0.1.js:113)
    at diff_match_patch.patch_apply (diff_match_patch.1.0.1.js:1968)
    at <anonymous>:4:17

If you know any workaround around this issue, please could you share it here.

EmilioDiez avatar Feb 15 '19 08:02 EmilioDiez

Probably what's happening is diff_match_patch.utf32_string.from is getting passed text that is already a utf32_string. Try adding a line to the top of that function that leaves if it is already a utf32_string.

  if (text.codePoints) {
    // Argument is another utf32_string; do not re-convert
    return text;
  }

sffc avatar Feb 15 '19 11:02 sffc

Thank you very much,

The exception is gone ๐Ÿ˜„ but I still have some Unicode strings that give errors. To reproduce, paste the following code on the chrome console:

var dmp = new diff_match_patch();
var str = '๐Ÿ‘๐Ÿ‘t๐Ÿ‘'; var patches = dmp.patch_make(str, 'test๐Ÿ‘');
var patchText = dmp.patch_toText(patches);
var _patches = dmp.patch_fromText(patchText);
var obj = dmp.patch_apply(_patches, str);
console.log(obj[0]);

it will output this "mojibake"

"tes๏ฟฝt๐Ÿ‘"

EmilioDiez avatar Feb 15 '19 17:02 EmilioDiez

Hi, thanks for finding these issues. I don't plan to spend any more time on my PR since it hasn't gotten any attention in 6 months. I encourage you to copy my branch into your own fork (pull it from my fork and push it to your fork) and then add your own commits where you fix these errors. You can then open a new PR and hopefully Neil will find some time one of these days to review it.

sffc avatar Feb 15 '19 20:02 sffc

Ok, no worries ๐Ÿ˜ข I will report back if I manage to fix the issue. Thank you again for the patch.

EmilioDiez avatar Feb 16 '19 03:02 EmilioDiez

Any updates?

ndvbd avatar May 24 '21 18:05 ndvbd

@ndvbd you can check out the branch in #80. it doesn't have fixes for all languages but it does for JavaScript. It's also only patched for toDelta since that's the function where we have the issue splitting pairs (since deltas are encoded as UTF-8 with percent-encoding).

We've been using it for more than a year in Simplenote and it's working fine. Previously we had some corruption issues that popped up on account of the issue with surrogate pairs.

dmsnell avatar May 24 '21 21:05 dmsnell

@dmsnell any idea why it is not merged into main?

ndvbd avatar May 25 '21 16:05 ndvbd

Looks like there are now at least 3 independent PRs with various approaches to fixing the issue: this one, #69, and #80. @NeilFraser Can you take a look at these and perhaps merge one?

sffc avatar May 25 '21 16:05 sffc

@ndvbd I would wager that @NeilFraser is busy with other things.

@sffc this approach is fairly heavy-handed - did you run it against the performance tests? what were the costs? How do you feel about converting to UTF-32 vs. the other approaches?

Per the comments in #69 I think that approach sounds better than it is in practice. The solution in #80 only addresses the delta generation (which is "fine" because this library doesn't specify that it produces Unicode byte streams/diffs) and it only imparts a marginal performance penalty and that only when surrogate halves are split. The only thing I see it lacking is the boundary cleanup as seen in the dissimilar because that normalizes all diff outputs. I haven't done that simply because it's working stably enough in #80 and I moved on to other problems.

Your feedback on those issues is welcome. Regardless of what takes place in master I'd like to see us all merge our ideas for resolving this and other encoding-related issues in diff-match-patch.

dmsnell avatar May 25 '21 17:05 dmsnell

I'm happy if #69 is enough to solve the issue; it's certainly a lot smaller of a patch than this one. There are a lot of code paths; I think the key thing with any of these is a thorough test suite.

sffc avatar May 25 '21 18:05 sffc

I've tried all 3 solutions to this problem and yours is the only one that solved the encodingURI exception for me. I created a minimal repo which reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue

Hope it's helpful to the developers of the other 2 alternate solutions. Thank you for your contributions!

gamedevsam avatar Aug 24 '22 04:08 gamedevsam