grapheme-splitter icon indicating copy to clipboard operation
grapheme-splitter copied to clipboard

अनुच्छेद => अ नु च्छे द

Open kotpal opened this issue 6 years ago • 5 comments

अनुच्छेद should return the 4 strings ["अ", "नु", "च्छे", "द"] and not ["अ","नु","च्","छे","द"]. Basically how the cursor acts in the string. The cursor skips over the 4 characters or graphemes to be more accurate.

kotpal avatar Jan 29 '19 18:01 kotpal

prob a dupe of #22. there's a bit more detail in #25.

vapier avatar Feb 07 '19 05:02 vapier

@kotpal I have the same issue. Have you managed to find a workaround?

Aditya-ds-1806 avatar Feb 27 '21 16:02 Aditya-ds-1806

Yes indeed, @Aditya-ds-1806. I found nota/split-graphemes on the same day I created this issue. It wasn't also handling Indic languages properly.

But when I created an issue, the problem got fixed a couple months later.

I stuck with split-graphemes and used it for all my linguistic projects henceforth - and highly recommend it.

kotpal avatar Feb 27 '21 19:02 kotpal

There is also https://github.com/flmnt/graphemer

papb avatar Mar 01 '21 22:03 papb

Graphemer doesn't split the Indic graphemes the way I want - much like how Notepad and many other complex-script aware programs do, @papb. When you use the cursor to navigate across the word, they treat च्छे as one character/grapheme rather than "च्" and "छे".

Readme for Graphemer says: splitter.splitGraphemes('अनुच्छेद'); // returns ["अ","नु","च्","छे","द"]

However, splitting अनुच्छेद into graphemes should return ["अ", "नु", "च्छे", "द"] - which split-graphemes properly handles.

kotpal avatar Mar 01 '21 23:03 kotpal