jsdiff Improve word tokenization for non-Latin characters

Improve word tokenization for non-Latin characters

Open jihunleekr opened this issue 4 years ago • 1 comments

Diff.diffWords is not working on non-Latin characters like Korean.

Sep 19 '21 09:09 jihunleekr

If I understand right, the problem is that right now we treat all CJK characters as if they were punctuation marks / word breaks, and the fix here treats them as letters instead. But:

the fix here also messes with combining diacritics in ways that seem to me to break existing working behaviour for languages with accents
the fix also changes other aspects of the logic of tokenize beyond which characters are treated as letters vs word breaks, and I can't figure out why
the fix doesn't help us with Japanese or Chinese since those languages don't use spaces (and need a fundamentally different tokenization algorithm like the one provided by Intl.Segmenter). Doesn't by itself make doing this a bad idea, but makes me wonder if we ought to be making a more radical change...

I'll come back to this in due course. Would love to get your input in the meantime, @jihunleekr, but I understand if in the 2 years since you opened this PR you've lost interest!

Dec 18 '23 13:12 ExplodingCabbage

jsdiff jsdiff copied to clipboard

Improve word tokenization for non-Latin characters

jsdiff
jsdiff copied to clipboard