SDL_ttf "Unicode" means "UCS-2"

It appears that when we say "Unicode" in SDL_ttf, we mean UCS-2 encoding (each char is 16-bits).

This covers the Basic Multilingual Plane, which covers an enormous amount of human language, but it does not cover the entirety of Unicode...and while probably no one cares about, I don't know, Klingon, the limitation means it can't do emoji glyphs, which people care about a lot.

The doesn't-break-ABI solution here is to say the "UNICODE" functions take UTF-16 encoding, which is an extension of UCS-2...most characters are the same, but there's some magic extension bits to make some codepoints take a two 16-bit value sequence, which gets you access to values > 0xFFFF. This is what win32 ended up doing, in WinXP or so, so all their Unicode functions didn't change but could handle the higher values when they show up in a string. UTF-16 is kind of the worst of all worlds: variable size like UTF-8 but wastes bits like UCS-4...but it gets the job done in a backwards compatible way.

If we want to break ABI, change the Unicode functions to take a Uint32 instead of a Uint16 (UCS-4 encoding)...each codepoint takes 32 bits and we're good to go.

Otherwise, probably look for STR_UNICODE in the source code and see where it gets used, and clean out UCS-2isms.

(If we do nothing, these higher codepoint values are available to apps if they encode their strings in UTF-8, since that can already represent those values.)

Jun 23 '22 14:06 icculus

STR_UNICODE is used to tell that user is giving a UNICODE input parameter (so UCS-2), and to convert it to the internal string format used (eg UTF8), by calling: UCS2_to_UTF8()

so the "doesn't-break-ABI" would be to replace UCS2_to_UTF8() by a "UTF16_to_UTF8()" function ... (and also UCS2_to_UTF8_len())

https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2772

/* Gets the number of bytes needed to convert a UCS2 string to UTF-8 */
static size_t UCS2_to_UTF8_len(const Uint16 *text)
{
    size_t bytes = 1;
    while (*text) {
        Uint16 ch = *text++;
        if (ch <= 0x7F) {
            bytes += 1;
        } else if (ch <= 0x7FF) {
            bytes += 2;
        } else {
            bytes += 3;
        }
    }
    return bytes;
}

https://github.com/libsdl-org/SDL_ttf/blob/main/SDL_ttf.c#L2804

/* Convert a UCS-2 string to a UTF-8 string */
static void UCS2_to_UTF8(const Uint16 *src, Uint8 *dst)
{
    SDL_bool swapped = TTF_byteswapped;

    while (*src) {
        Uint16 ch = *src++;
        if (ch == UNICODE_BOM_NATIVE) {
            swapped = SDL_FALSE;
            continue;
        }
        if (ch == UNICODE_BOM_SWAPPED) {
            swapped = SDL_TRUE;
            continue;
        }
        if (swapped) {
            ch = SDL_Swap16(ch);
        }
        if (ch <= 0x7F) {
            *dst++ = (Uint8) ch;
        } else if (ch <= 0x7FF) {
            *dst++ = 0xC0 | (Uint8) ((ch >> 6) & 0x1F);
            *dst++ = 0x80 | (Uint8) (ch & 0x3F);
        } else {
            *dst++ = 0xE0 | (Uint8) ((ch >> 12) & 0x0F);
            *dst++ = 0x80 | (Uint8) ((ch >> 6) & 0x3F);
            *dst++ = 0x80 | (Uint8) (ch & 0x3F);
        }
    }
    *dst = '\0';
}

the breaking abi, is probably the same by using UTF32_to_UTF8 functions and changing API prototype to Uint32 *, instead of Uint16 *

Jun 23 '22 15:06 1bsyl

@slouken Shouldn't

        if (ch == UNICODE_BOM_NATIVE) {
            swapped = SDL_FALSE;
            continue;
        }
        if (ch == UNICODE_BOM_SWAPPED) {
            swapped = SDL_TRUE;
            continue;
        }
        if (swapped) {
            ch = SDL_Swap16(ch);
        }

be also present in UCS2_to_UTF8_len() ?

Jun 24 '22 07:06 1bsyl

Yep!

Jun 27 '22 18:06 slouken

If we want to break ABI, change the Unicode functions to take a Uint32 instead of a Uint16 (UCS-4 encoding)...each codepoint takes 32 bits and we're good to go.

This seems like a bad reason to break ABI. Environments that care about preserving ABI (like the Steam Runtime) would have to continue to ship the old SONAME in parallel with the new one forever, except the old SONAME would no longer be receiving bug fixes, which seems bad...

If you want UCS-4 support, I'd suggest having a new family of functions like TTF_RenderUCS4_Solid() which convert UCS-4 to UTF-8.

Expanding the UNICODE functions to re-interpret their parameter as UTF-16 instead of UCS-2 also seems a reasonable route to take. This is a compatible change, because "surrogates" (the escape characters used to encode non-BMP codepoints in UTF-16) are technically not considered to be valid UCS-2 anyway.

There are basically three strategies for dealing with Unicode:

standardize on UTF-8 and convert everything else to that (GTK, Harfbuzz, modern Linux in general, macOS, Rust)
standardize on UTF-16 (or historically UCS-2) and convert everything else to that (Qt, Windows, Java)
standardize on UCS-4 and convert everything else to that (rarely done)

SDL_ttf already converts its inputs to UTF-8 and works with UTF-8 internally, so it's basically already using the GTK/Linux/macOS/Rust strategy - which happens to be the one I prefer, because UTF-8 is fully backwards-compatible with ASCII, encodes "mostly-ASCII" text efficiently, is endian-neutral, and is overwhelmingly popular on the web.

UTF-16 combines the disadvantages of UCS-4 with the disadvantages of UTF-8, and I suspect nobody would be using it if Windows and Java hadn't needed an exit strategy from UCS-2.

UCS-4 is superficially appealing because each codepoint is fixed-byte-width, but things like combining characters and emoji modifiers mean that a codepoint isn't the same as a glyph, so counting codepoints is usually not actually the right thing to do.

Jun 29 '22 12:06 smcv

There are already UCS4 versions of the API functions, and the next ABI break will remove everything but UTF-8 support, since that's what most people are already using and it's trivial to convert from UCS*/UTF* to that.

I'm not opposed to upgrading UCS2 to UTF-16, but otherwise I don't think we'll make any changes here.

Jun 29 '22 13:06 slouken

I don't think we have UCS4 version of the API ! (I don't think we should add it) We've got : UTF8, UNICODE(so UCS2), and TEXT(latin1).

Jun 30 '22 08:06 1bsyl

If someone has a good knowledge of UCS2/UTF-16, it sounds like it's a 10/20 line patch by modifying the two previous functions ?

Jun 30 '22 08:06 1bsyl

the next ABI break will remove everything but UTF-8 support

If that's the case, then perhaps just mark the non-UTF-8 APIs as deprecated and don't otherwise change them? It doesn't seem particularly useful to add UCS-4 or UTF-16 support if it's just going to be removed again.

Jun 30 '22 16:06 smcv

SDL_ttf SDL_ttf copied to clipboard

"Unicode" means "UCS-2"

SDL_ttf
SDL_ttf copied to clipboard