text-icu icon indicating copy to clipboard operation
text-icu copied to clipboard

Getting the size of a grapheme cluster

Open Kleidukos opened this issue 2 years ago • 4 comments

I'd like to get the size of a grapheme cluster (from a value of type Text). Is there a function in the library that can help me with it? If not, is it in the scope of the library to provide one?

Kleidukos avatar Oct 03 '23 08:10 Kleidukos

I'm not even sure what does the "size of a grapheme cluster" mean.

There are various ways to normalize text (compose/decompose grapheme clusters) https://hackage.haskell.org/package/text-icu-0.8.0.3/docs/Data-Text-ICU-Normalize2.html

Maybe unorm2_composePair() can help to compose those clusters and get their size.

vshabanov avatar Oct 04 '23 22:10 vshabanov

I'm not even sure what does the "size of a grapheme cluster" mean.

It's the operation that gives the length in graphemes, not code points. For example, the length of this grapheme cluster: "🤦🏼‍♂️" is 1.

This is an interesting problem, there's a short read about it here: https://tonsky.me/blog/unicode/

Kleidukos avatar Oct 05 '23 13:10 Kleidukos

@Kleidukos In Agda we use cluster counting as linked below, is that what you are looking for? https://github.com/agda/agda/blob/4c5501e369b63ff3eabdbb3217db59904baf0e78/src/full/Agda/Interaction/Highlighting/LaTeX/Base.hs#L708-L716 length . ICU.breaks (ICU.breakCharacter ICU.Root)

andreasabel avatar Oct 07 '23 14:10 andreasabel

Oh yeah definitely! I'm quite surprised it's not offered by the library directly. Thanks @andreasabel!

Kleidukos avatar Oct 07 '23 14:10 Kleidukos