Incorrect parsing of escaped characters from higher unicode planes in a JSON string
Bug Report
Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the \u#### notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g. \ud83e\udd17, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.
This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.
To Reproduce
Setup a simple stdin to stdout pipeline, pass {"text": "\ud83e\udd17"} to stdin.
Out comes a mangled
[{"date":1733946314.083699,"text":"������"}]
Expected behavior The output message should be:
[{"date":1733946229.895005,"text":"🤗"}]
Your Environment
- Version used: 3.2.2
- Configuration:
-i stdin -o stdout - Environment name and version (e.g. Kubernetes? What version?): docker image, 3.2.2
- Server type and version: n/a
- Operating System and version: Fedora Linux 40
- Filters and plugins: none
Additional context
This is mangling python json module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.
There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.
FYI, the reason there are 6 replacement characters (�) in place of the 🤗 emoji is because its standalone high surrogate U+d83e is utf-8 encoded as ed a0 be (3 bytes, that don't map to a valid character) and the low surrogate U+dd17 is encoded as ed b4 97 (the other 3 invalid bytes).
assigned @cosmo0920
I realized I was referring to an obsoleted JSON RFC, however, nothing changed in the latest regarding Unicode escaping.
I also did a little code search and found the likely place of the problem - u8_read_escape_sequence:
else if (str[0] == 'u') {
while (i < size && hex_digit(str[i]) && dno < 4) {
digs[dno++] = str[i++];
}
if (dno > 0) {
ch = strtol(digs, NULL, 16);
}
}
The referenced code above simply converts one \uXXXX escape string to a unicode character (ch) without checking if it's a surrogate pair member. This check needs to be added.
Interestingly enough, just below the referenced code there's a logic parsing \Uxxxxxxxx, which is an escape sequence used in many programming languages and data exchange formats like C/C++, Python or YAML, but that's kinda it. JSON is a part of a different bunch (with Java and C#) that are using those awful surrogate pairs to achieve the same goal. This code can stay there, because it doesn't really hurt anything ;)
The following code could serve as a fix inspiration. As I'm far from being fluent-bit-fluent, I haven't properly tackled the several error cases, just returned -1 for show (wasn't sure if flb_error would be a good fit there):
static bool is_high_surrogate(uint32_t ch) {
return ch >= 0xD800 && ch <= 0xDBFF;
}
static bool is_low_surrogate(uint32_t ch) {
return ch >= 0xDC00 && ch <= 0xDFFF;
}
static uint32_t combine_surrogates(uint32_t high, uint32_t low) {
return 0x10000 + (((high - 0xD800) << 10) | (low - 0xDC00));
static int u8_read_escape_sequence(const char *str, int size, uint32_t *dest)
// ....
if (str[0] == 'u') {
// Parse the first 4 hex digits
while (i < size && hex_digit(str[i]) && dno < 4) {
digs[dno++] = str[i++];
}
if (dno == 4) {
ch = strtol(digs, NULL, 16);
if (is_low_surrogate(ch)) {
// Invalid: low surrogate without preceding high surrogate
return -1;
} else if (is_high_surrogate(ch)) {
// Handle surrogate pair
if (i + 6 < size && str[i] == '\\' && str[i + 1] == 'u') {
dno = 0;
i += 2; // Skip "\u"
while (i < size && hex_digit(str[i]) && dno < 4) {
digs[dno++] = str[i++];
}
if (dno == 4) {
uint32_t low = strtol(digs, NULL, 16);
if (is_low_surrogate(low)) {
ch = combine_surrogates(ch, low);
} else {
// Invalid: high surrogate not followed by low surrogate
return -1;
}
} else {
// Incomplete low surrogate
return -1;
}
} else {
// Invalid: high surrogate not followed by \u
return -1;
}
}
} else {
// Incomplete \u escape sequence
return -1;
}
// ...
In case of those errors one could set ch to the Unicode Replacement Character to not die entirely.