rgbds Handle UTF-8 string case conversion properly

STRUPR and STRLWR do not handle non-ASCII text properly, as in the STRUPR call below:

issotm@sheik-kitty ~/rgbds% cat test/asm/string.asm
    PRINTT STRCAT("Left", "right\n")
    PRINTT STRUPR("Garçon, café, s'il vous plaît !\n")
    PRINTT STRLWR("\"Hello!\" 「今日は！」\n")
issotm@sheik-kitty ~/rgbds% ./rgbasm test/asm/string.asm
Leftright
GARçON, CAFé, S'IL VOUS PLAîT !
"hello!" 「今日は！」

Processing Unicode correctly beyond UTF8 encoding/decoding is difficult, so it would probably be best to use an external library for this. 0.4.3 / 0.5.0 already changed dependencies (Yacc → Bison), so this is probably a good opportunity. Two questions, then:

Which library should we use, or should we roll our own? The Unicode consortium FAQ recommends ICU.
Should we directly link against it (handled by compiler, cross-platform, no extra complexity), or dynamically load it (dependency optional)?

Dec 12 '20 14:12 ISSOtm

ICU's license is a bit special, so you might want to consider making it an optional component, which would allow you to not distribute it with RGBDS (even in binary form).

Dec 12 '20 21:12 aaaaaa123456789

The ICU library (libicudata.a, libicui18n.a, libicuio.a, libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring (libunistring.a) is around 2MB, which is unacceptable for static linking. Both take many minutes to compile even on a good computer and require a lot of dependencies, including Python for ICU. On the other hand libgrapheme (libgrapheme.a) only weighs in at around 40K and is compiled (including Unicode data parsing) in fractions of a second, requiring nothing but a C99 compiler and make(1).

While ICU and libunistring offer a lot of functions and the weight mostly comes from locale-data provided by the Unicode standard, which is applied implementation-specifically (!) for some things, the same standard always defines a sane 'default' behaviour as an alternative in such cases that is satisfying in 99% of the cases and which you can rely on.

-- https://libs.suckless.org/libgrapheme/

Dec 22 '21 20:12 Rangi42

If the only thing we need more Unicode handling for is case conversion, https://github.com/rust-lang/rust/blob/master/library/core/src/unicode/unicode_data.rs looks portable without needing an entire ICU library.

Oct 29 '23 17:10 Rangi42

Honestly I don't think we want to depend on a version of the Unicode Standard, and given RGBASM's existing ASCII reliance, I'm of the opinion that we should define the case conversion functions to only work on ASCII?

Oct 29 '23 18:10 ISSOtm

Yeah, that would be sensible.

Oct 29 '23 18:10 Rangi42

I'm happy with that approach.

Oct 29 '23 18:10 aaaaaa123456789

rgbds rgbds copied to clipboard

Handle UTF-8 string case conversion properly

rgbds
rgbds copied to clipboard