stringr icon indicating copy to clipboard operation
stringr copied to clipboard

str_split not splitting correctly on Unicode character

Open alexanderbeatson opened this issue 1 year ago • 2 comments

I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.

str_split("စမ်းသပ်မှု", "")[[1]]

it returns:

[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"

If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]] it returns character level:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:

[1] "စမ်း" "သပ်" "မှု"

So, I don't think it is actually a feature like Issue:88

For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.

alexanderbeatson avatar Mar 29 '24 06:03 alexanderbeatson

... and what would be the correct result?

gagolews avatar Apr 02 '24 13:04 gagolews

Correct return should be:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

alexanderbeatson avatar Apr 04 '24 06:04 alexanderbeatson

All I know about Burmese is what I've just read about on wikipedia, but it sounds like you're looking to break up into individual code points, not characters (which because Burmese is a abugida, not an alphabet, represent syllables, not individual vowels and consonants).

I don't see an obvious way to do this with stringi, but @gagolews might.

hadley avatar Jul 15 '24 21:07 hadley

@hadley Thank you for raising the point. Burmese is indeed an abugida.

I understand that all of pseudo-alphabet languages have their own structure and confusing, and there might even controversial breakdown system.

Please let me explain in detail of breaking down the phrase "စမ်းသပ်မှု" (meaning "testing" or "test")

  • "စမ်းသပ်မှု" is a single word
  • contains 3 distinct syllables ["စမ်း", "သပ်", "မှု"]

str_split() is trying to break the syllables into (grammatically) illegal groups. For example, it breaks "စမ်း" into ["စ", "မ်", "း"] that ["မ်", "း"] are grammatically illegal to standalone.

I am a native Burmese NLP researcher and I believe I could help in this implementation. I recently developed bursyl, regex-based Burmese syllabification algorithm (with a very strict grammatical rule but can be adjusted), and potentially implement it into stringi for splitting Burmese langauge @gagolews ?

alexanderbeatson avatar Jul 16 '24 06:07 alexanderbeatson

On a side note, https://unicode-org.github.io/icu/userguide/boundaryanalysis/ says that:

*Dictionary-Based BreakIterator

Some languages are written without spaces, and word and line breaking requires more than rules over character sequences. ICU provides dictionary support for word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.

Use of the dictionaries is automatic when text in one of the dictionary languages is encountered. There is no separate API, and no extra programming steps required by applications making use of the dictionaries.*

gagolews avatar Jul 16 '24 07:07 gagolews