jsdiff
jsdiff copied to clipboard
Improve word tokenization for non-Latin characters
Diff.diffWords is not working on non-Latin characters like Korean.
If I understand right, the problem is that right now we treat all CJK characters as if they were punctuation marks / word breaks, and the fix here treats them as letters instead. But:
- the fix here also messes with combining diacritics in ways that seem to me to break existing working behaviour for languages with accents
- the fix also changes other aspects of the logic of
tokenizebeyond which characters are treated as letters vs word breaks, and I can't figure out why - the fix doesn't help us with Japanese or Chinese since those languages don't use spaces (and need a fundamentally different tokenization algorithm like the one provided by
Intl.Segmenter). Doesn't by itself make doing this a bad idea, but makes me wonder if we ought to be making a more radical change...
I'll come back to this in due course. Would love to get your input in the meantime, @jihunleekr, but I understand if in the 2 years since you opened this PR you've lost interest!