fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Incorrect parsing of escaped characters from higher unicode planes in a JSON string

Open vit-zikmund opened this issue 1 year ago • 2 comments

Bug Report

Describe the bug All characters in a JSON string are by its specification Unicode and all can be escaped using the \u#### notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g. \ud83e\udd17, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.

This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.

To Reproduce Setup a simple stdin to stdout pipeline, pass {"text": "\ud83e\udd17"} to stdin. Out comes a mangled

[{"date":1733946314.083699,"text":"������"}]

Expected behavior The output message should be:

[{"date":1733946229.895005,"text":"🤗"}]

Your Environment

  • Version used: 3.2.2
  • Configuration: -i stdin -o stdout
  • Environment name and version (e.g. Kubernetes? What version?): docker image, 3.2.2
  • Server type and version: n/a
  • Operating System and version: Fedora Linux 40
  • Filters and plugins: none

Additional context This is mangling python json module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.

There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.

vit-zikmund avatar Dec 11 '24 20:12 vit-zikmund

FYI, the reason there are 6 replacement characters (�) in place of the 🤗 emoji is because its standalone high surrogate U+d83e is utf-8 encoded as ed a0 be (3 bytes, that don't map to a valid character) and the low surrogate U+dd17 is encoded as ed b4 97 (the other 3 invalid bytes).

vit-zikmund avatar Dec 12 '24 08:12 vit-zikmund

assigned @cosmo0920

edsiper avatar Dec 17 '24 16:12 edsiper

I realized I was referring to an obsoleted JSON RFC, however, nothing changed in the latest regarding Unicode escaping.

I also did a little code search and found the likely place of the problem - u8_read_escape_sequence:

    else if (str[0] == 'u') {
        while (i < size && hex_digit(str[i]) && dno < 4) {
            digs[dno++] = str[i++];
        }
        if (dno > 0) {
            ch = strtol(digs, NULL, 16);
        }
    }

The referenced code above simply converts one \uXXXX escape string to a unicode character (ch) without checking if it's a surrogate pair member. This check needs to be added.

Interestingly enough, just below the referenced code there's a logic parsing \Uxxxxxxxx, which is an escape sequence used in many programming languages and data exchange formats like C/C++, Python or YAML, but that's kinda it. JSON is a part of a different bunch (with Java and C#) that are using those awful surrogate pairs to achieve the same goal. This code can stay there, because it doesn't really hurt anything ;)

The following code could serve as a fix inspiration. As I'm far from being fluent-bit-fluent, I haven't properly tackled the several error cases, just returned -1 for show (wasn't sure if flb_error would be a good fit there):

static bool is_high_surrogate(uint32_t ch) {
    return ch >= 0xD800 && ch <= 0xDBFF;
}

static bool is_low_surrogate(uint32_t ch) {
    return ch >= 0xDC00 && ch <= 0xDFFF;
}

static uint32_t combine_surrogates(uint32_t high, uint32_t low) {
    return 0x10000 + (((high - 0xD800) << 10) | (low - 0xDC00));

static int u8_read_escape_sequence(const char *str, int size, uint32_t *dest)
// ....
    if (str[0] == 'u') {
        // Parse the first 4 hex digits
        while (i < size && hex_digit(str[i]) && dno < 4) {
            digs[dno++] = str[i++];
        }
        if (dno == 4) {
            ch = strtol(digs, NULL, 16);
            if (is_low_surrogate(ch)) {
                // Invalid: low surrogate without preceding high surrogate
                return -1;
            } else if (is_high_surrogate(ch)) {
                // Handle surrogate pair
                if (i + 6 < size && str[i] == '\\' && str[i + 1] == 'u') {
                    dno = 0;
                    i += 2; // Skip "\u"
                    while (i < size && hex_digit(str[i]) && dno < 4) {
                        digs[dno++] = str[i++];
                    }
                    if (dno == 4) {
                        uint32_t low = strtol(digs, NULL, 16);
                        if (is_low_surrogate(low)) {
                            ch = combine_surrogates(ch, low);
                        } else {
                            // Invalid: high surrogate not followed by low surrogate
                            return -1;
                        }
                    } else {
                        // Incomplete low surrogate
                        return -1;
                    }
                } else {
                    // Invalid: high surrogate not followed by \u
                    return -1;
                }
            }
        } else {
            // Incomplete \u escape sequence
            return -1;
        }
// ...

In case of those errors one could set ch to the Unicode Replacement Character to not die entirely.

vit-zikmund avatar Jan 03 '25 16:01 vit-zikmund