Illegal Unicode Escaping

Open brcolow opened this issue 7 years ago • 1 comments

This is kind of a "gotcha" type issue but I think my reasoning is sound. nanojson parses Java character literals correctly (and goes to great lengths to make sure the UTF-8 is valid) in the consumeTokenStringUtf8Char method. But when the characters are escaped (not Java character literals) it uses the following simplified logic:

int escaped = 0;

for (int i = 0; i < 4; i++) {
    escaped <<= 4;
    int digit = buffer[index++];
    if (digit >= '0' && digit <= '9') {
        escaped |= (digit - '0');
    } else if (digit >= 'A' && digit <= 'F') {
        escaped |= (digit - 'A') + 10;
    } else if (digit >= 'a' && digit <= 'f') {
        escaped |= (digit - 'a') + 10;
    } else {
        throw createParseException(null, "Expected unicode hex escape character: " +
                (char) digit + " (" + digit + ")", false);
    }
}

reusableBuffer.append((char) escaped);

which is not sufficient because there are illegal combinations of encoded Unicode characters (as well as illegal single characters) which I will now attempt to show/describe.

The JSON spec states that any character may be escaped. The method to escape characters outside of the Basic Multilingual Plane (U+0000 through U+FFFF) is as follows:

`To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

In other words, characters above U+FFFF are encoded as surrogate pairs. Thus, the following JSON: "\uD800\uDC00" should decode to the Unicode character: U+10000.

The following JSON Strings:

1.) "\uD800" 2.) "\uD800\uCC00"

Should fail parsing because a high surrogate character (U+D800) is:

in 1.) not followed by another Unicode character in 2.) not followed by a low surrogate character

In other words, the following tests should not fail:

@Test
public void testFailBustedString8() {
    try {
        // High-surrogate character not followed by another Unicode character
        JsonParser.any().from("\"\\uD800\"");
        fail();
    } catch (JsonParserException e) {
        testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
    }
}

@Test
public void testFailBustedString9() {
    try {
        // High-surrogate character not followed by a low-surrogate character
        JsonParser.any().from("\"\\uD800\\uCC00\"");
        fail();
    } catch (JsonParserException e) {
        testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
    }
}

Thanks for your time, let me know if anything is not clear.

May 10 '18 23:05 brcolow

Great explanation. You are correct - the parser probably not accept these. However, as Javascript's JSON methods do support stringify and parse illegal surrogates (at least Firefox did in my quick test), this should probably be gated behind a parser flag that isn't set by default.

Ideally the parser and writer would have a "strict utf16 surrogate" mode, ie:

JsonWriter.strictSurrogateSupport().indent("  ") ...

May 11 '18 03:05 mmastrac