jsdiff icon indicating copy to clipboard operation
jsdiff copied to clipboard

Improve word tokenization for non-Latin characters

Open jihunleekr opened this issue 4 years ago • 1 comments

Diff.diffWords is not working on non-Latin characters like Korean.

jihunleekr avatar Sep 19 '21 09:09 jihunleekr

If I understand right, the problem is that right now we treat all CJK characters as if they were punctuation marks / word breaks, and the fix here treats them as letters instead. But:

  • the fix here also messes with combining diacritics in ways that seem to me to break existing working behaviour for languages with accents
  • the fix also changes other aspects of the logic of tokenize beyond which characters are treated as letters vs word breaks, and I can't figure out why
  • the fix doesn't help us with Japanese or Chinese since those languages don't use spaces (and need a fundamentally different tokenization algorithm like the one provided by Intl.Segmenter). Doesn't by itself make doing this a bad idea, but makes me wonder if we ought to be making a more radical change...

I'll come back to this in due course. Would love to get your input in the meantime, @jihunleekr, but I understand if in the 2 years since you opened this PR you've lost interest!

ExplodingCabbage avatar Dec 18 '23 13:12 ExplodingCabbage