jackson-core icon indicating copy to clipboard operation
jackson-core copied to clipboard

Poor support for UTF-8 without CANONICALIZE_FIELD_NAMES and non UTF-8 encoded byte payloads

Open fabienrenaud opened this issue 5 years ago • 1 comments

In ByteSourceJsonBootstrapper#L247:

    public JsonParser constructParser(ObjectReadContext readCtxt,
            int streamReadFeatures, int formatReadFeatures,
            ByteQuadsCanonicalizer rootByteSymbols, CharsToNameCanonicalizer rootCharSymbols,
            int factoryFeatures) throws IOException
    {
        JsonEncoding enc = detectEncoding();

        if (enc == JsonEncoding.UTF8) {
            /* and without canonicalization, byte-based approach is not performant; just use std UTF-8 reader
             * (which is ok for larger input; not so hot for smaller; but this is not a common case)
             */
            if (JsonFactory.Feature.CANONICALIZE_FIELD_NAMES.enabledIn(factoryFeatures)) {
                ByteQuadsCanonicalizer can = rootByteSymbols.makeChild(factoryFeatures);
                return new UTF8StreamJsonParser(readCtxt, _context,
                        streamReadFeatures, formatReadFeatures, _in, can,
                        _inputBuffer, _inputPtr, _inputEnd, _bufferRecyclable);
            }
        }
        return new ReaderBasedJsonParser(readCtxt, _context, streamReadFeatures, formatReadFeatures,
                constructReader(),
                rootCharSymbols.makeChild(factoryFeatures));
    }

The if block is clearly an optimization and skipped when the CANONICALIZE_FIELD_NAMES feature is disabled.

This means existing tests should pass when this block is skipped. However, commenting said block causes 9 test failures (7 seem legit):

> mvn test
...
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   LocationInArrayTest.testOffsetInArraysBytes:12->_testOffsetInArrays:29->_assertLocation:61 expected:<2> but was:<-1>
[ERROR]   LocationInObjectTest.testOffsetWithObjectFieldsUsingUTF8:18 expected:<1> but was:<-1>
[ERROR]   LocationOffsetsTest.testOffsetWithInputOffset:68 expected:<0> but was:<-1>
[ERROR]   LocationOffsetsTest.testSimpleInitialOffsets:41 expected:<0> but was:<-1>
[ERROR]   TestArrayParsing.testInvalidExtraComma:70->BaseTest.verifyException:481 Expected an exception with one of substrings ([expected a value]): got one with message "Unexpected character (']' (code 93)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: (ByteArrayInputStream); line: 1, column: 8]"
[ERROR]   TestParserClosing.testReleaseContentBytes:110 expected:<6> but was:<-1>
[ERROR]   JsonParserTest.testUtf8BOMHandling:443 expected:<0> but was:<-1>
[ERROR]   SymbolTableMergingTest.testByteSymbolsWithClose:33->_testWithClose:87->_getParser:106 expected:<class com.fasterxml.jackson.core.json.UTF8StreamJsonParser> but was:<class com.fasterxml.jackson.core.json.ReaderBasedJsonParser>
[ERROR]   SymbolTableMergingTest.testByteSymbolsWithEOF:39->_getParser:106 expected:<class com.fasterxml.jackson.core.json.UTF8StreamJsonParser> but was:<class com.fasterxml.jackson.core.json.ReaderBasedJsonParser>
[ERROR] Errors:
[ERROR]   JsonParserTest.testBytesAsSource:418 » JsonParse Illegal character ((CTRL-CHAR...
[ERROR]   TestSymbolTables.testByteBasedSymbolTable:149->_findSymbols:178 ClassCast com....
[INFO]
[ERROR] Tests run: 870, Failures: 9, Errors: 2, Skipped: 0

Or, rather than removing the entire block, the JsonFactory in test classes such as LocationOffsetsTest, BaseTest etc. can be altered to disable the canonicalize_field_names feature:

    final JsonFactory JSON_F = new JsonFactory(new JsonFactoryBuilder()
        .disable(TokenStreamFactory.Feature.CANONICALIZE_FIELD_NAMES));

fabienrenaud avatar Apr 24 '19 06:04 fabienrenaud

Thank you for reporting this. Looks like there are different root causes for test failures; both paths should be checked probably.

Tricky one is the one wrt byte vs char offsets: tricky not because it'd be difficult to change test to match assumptions, but rather that behavior is actually bit weird from user perspective -- if you feed byte-based source, it'd seem reasonable to get byte-based offsets.

I'll have to think what a good solution would be. I don't think much can be done for 2.9 (except possibly changing tests).

cowtowncoder avatar Apr 24 '19 23:04 cowtowncoder