diff-match-patch icon indicating copy to clipboard operation
diff-match-patch copied to clipboard

Diff breaks unicode characters for emojis

Open orromis opened this issue 5 years ago • 5 comments

I'm working message log in our app. We want to show diff of changes any user made on text posts. Those posts may include emoji characters but diff_match_patch replaces those characters with � character (but only if they changed in the text).

The behaviour can be reproduced here: https://neil.fraser.name/software/diff_match_patch/demos/diff.html

Paste 😉and 😀into textareas and compute the diff.

Why is this happening?

orromis avatar Feb 28 '19 15:02 orromis

Looks like the diffing doesn't consider whether a character is non-ASCII and breaks unicode emojis in pieces if they are different but occupy the same space. That would result in some unknown character � after all is said and done. Assuming it's conversion related.

Looking into it out of curiosity.

mcataford avatar Mar 06 '19 19:03 mcataford

Looks like @yetanotherape has solved it for their PHP fork:

  • https://github.com/yetanotherape/diff-match-patch/issues/9
  • https://github.com/yetanotherape/diff-match-patch/commit/3e7b0241a06b20ad348c1d35f77204d02ec346bc

The two current attempts at solving it in this repo have both had complications:

  • https://github.com/google/diff-match-patch/pull/13
  • https://github.com/google/diff-match-patch/pull/69

josephrocca avatar Aug 18 '19 05:08 josephrocca

Any updates here?

ndvbd avatar May 24 '21 18:05 ndvbd

@ndvbd I ended up "solving" it by just escaping all the special unicode stuff with text = encodeURI(text) before saving the text (and using decodeURI(...) to undo it, of course). Bit of a hack but it works for my use case.

josephrocca avatar May 24 '21 23:05 josephrocca

Check out #80, as referenced above @ndvbd's comment. It should handle all the surrogate pairs properly.

dmsnell avatar May 25 '21 00:05 dmsnell