grapheme-splitter
grapheme-splitter copied to clipboard
अनुच्छेद => अ नु च्छे द
अनुच्छेद should return the 4 strings ["अ", "नु", "च्छे", "द"] and not ["अ","नु","च्","छे","द"]. Basically how the cursor acts in the string. The cursor skips over the 4 characters or graphemes to be more accurate.
prob a dupe of #22. there's a bit more detail in #25.
@kotpal I have the same issue. Have you managed to find a workaround?
Yes indeed, @Aditya-ds-1806. I found nota/split-graphemes on the same day I created this issue. It wasn't also handling Indic languages properly.
But when I created an issue, the problem got fixed a couple months later.
I stuck with split-graphemes and used it for all my linguistic projects henceforth - and highly recommend it.
There is also https://github.com/flmnt/graphemer
Graphemer doesn't split the Indic graphemes the way I want - much like how Notepad and many other complex-script aware programs do, @papb. When you use the cursor to navigate across the word, they treat च्छे as one character/grapheme rather than "च्" and "छे".
Readme for Graphemer says: splitter.splitGraphemes('अनुच्छेद'); // returns ["अ","नु","च्","छे","द"]
However, splitting अनुच्छेद into graphemes should return ["अ", "नु", "च्छे", "द"] - which split-graphemes properly handles.