text-icu
text-icu copied to clipboard
Getting the size of a grapheme cluster
I'd like to get the size of a grapheme cluster (from a value of type Text). Is there a function in the library that can help me with it? If not, is it in the scope of the library to provide one?
I'm not even sure what does the "size of a grapheme cluster" mean.
There are various ways to normalize text (compose/decompose grapheme clusters) https://hackage.haskell.org/package/text-icu-0.8.0.3/docs/Data-Text-ICU-Normalize2.html
Maybe unorm2_composePair() can help to compose those clusters and get their size.
I'm not even sure what does the "size of a grapheme cluster" mean.
It's the operation that gives the length in graphemes, not code points. For example, the length of this grapheme cluster: "🤦🏼♂️" is 1.
This is an interesting problem, there's a short read about it here: https://tonsky.me/blog/unicode/
@Kleidukos In Agda we use cluster counting as linked below, is that what you are looking for?
https://github.com/agda/agda/blob/4c5501e369b63ff3eabdbb3217db59904baf0e78/src/full/Agda/Interaction/Highlighting/LaTeX/Base.hs#L708-L716
length . ICU.breaks (ICU.breakCharacter ICU.Root)
Oh yeah definitely! I'm quite surprised it's not offered by the library directly. Thanks @andreasabel!