rgbds
rgbds copied to clipboard
Handle UTF-8 string case conversion properly
STRUPR and STRLWR do not handle non-ASCII text properly, as in the STRUPR call below:
issotm@sheik-kitty ~/rgbds% cat test/asm/string.asm
PRINTT STRCAT("Left", "right\n")
PRINTT STRUPR("Garçon, café, s'il vous plaît !\n")
PRINTT STRLWR("\"Hello!\" 「今日は!」\n")
issotm@sheik-kitty ~/rgbds% ./rgbasm test/asm/string.asm
Leftright
GARçON, CAFé, S'IL VOUS PLAîT !
"hello!" 「今日は!」
Processing Unicode correctly beyond UTF8 encoding/decoding is difficult, so it would probably be best to use an external library for this. 0.4.3 / 0.5.0 already changed dependencies (Yacc → Bison), so this is probably a good opportunity. Two questions, then:
- Which library should we use, or should we roll our own? The Unicode consortium FAQ recommends ICU.
- Should we directly link against it (handled by compiler, cross-platform, no extra complexity), or dynamically load it (dependency optional)?
ICU's license is a bit special, so you might want to consider making it an optional component, which would allow you to not distribute it with RGBDS (even in binary form).
The ICU library (libicudata.a, libicui18n.a, libicuio.a, libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring (libunistring.a) is around 2MB, which is unacceptable for static linking. Both take many minutes to compile even on a good computer and require a lot of dependencies, including Python for ICU. On the other hand libgrapheme (libgrapheme.a) only weighs in at around 40K and is compiled (including Unicode data parsing) in fractions of a second, requiring nothing but a C99 compiler and make(1).
While ICU and libunistring offer a lot of functions and the weight mostly comes from locale-data provided by the Unicode standard, which is applied implementation-specifically (!) for some things, the same standard always defines a sane 'default' behaviour as an alternative in such cases that is satisfying in 99% of the cases and which you can rely on.
-- https://libs.suckless.org/libgrapheme/
If the only thing we need more Unicode handling for is case conversion, https://github.com/rust-lang/rust/blob/master/library/core/src/unicode/unicode_data.rs looks portable without needing an entire ICU library.
Honestly I don't think we want to depend on a version of the Unicode Standard, and given RGBASM's existing ASCII reliance, I'm of the opinion that we should define the case conversion functions to only work on ASCII?
Yeah, that would be sensible.
I'm happy with that approach.