jackson-core
jackson-core copied to clipboard
Poor support for UTF-8 without CANONICALIZE_FIELD_NAMES and non UTF-8 encoded byte payloads
In ByteSourceJsonBootstrapper#L247:
public JsonParser constructParser(ObjectReadContext readCtxt,
int streamReadFeatures, int formatReadFeatures,
ByteQuadsCanonicalizer rootByteSymbols, CharsToNameCanonicalizer rootCharSymbols,
int factoryFeatures) throws IOException
{
JsonEncoding enc = detectEncoding();
if (enc == JsonEncoding.UTF8) {
/* and without canonicalization, byte-based approach is not performant; just use std UTF-8 reader
* (which is ok for larger input; not so hot for smaller; but this is not a common case)
*/
if (JsonFactory.Feature.CANONICALIZE_FIELD_NAMES.enabledIn(factoryFeatures)) {
ByteQuadsCanonicalizer can = rootByteSymbols.makeChild(factoryFeatures);
return new UTF8StreamJsonParser(readCtxt, _context,
streamReadFeatures, formatReadFeatures, _in, can,
_inputBuffer, _inputPtr, _inputEnd, _bufferRecyclable);
}
}
return new ReaderBasedJsonParser(readCtxt, _context, streamReadFeatures, formatReadFeatures,
constructReader(),
rootCharSymbols.makeChild(factoryFeatures));
}
The if
block is clearly an optimization and skipped when the CANONICALIZE_FIELD_NAMES
feature is disabled.
This means existing tests should pass when this block is skipped. However, commenting said block causes 9 test failures (7 seem legit):
> mvn test
...
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] LocationInArrayTest.testOffsetInArraysBytes:12->_testOffsetInArrays:29->_assertLocation:61 expected:<2> but was:<-1>
[ERROR] LocationInObjectTest.testOffsetWithObjectFieldsUsingUTF8:18 expected:<1> but was:<-1>
[ERROR] LocationOffsetsTest.testOffsetWithInputOffset:68 expected:<0> but was:<-1>
[ERROR] LocationOffsetsTest.testSimpleInitialOffsets:41 expected:<0> but was:<-1>
[ERROR] TestArrayParsing.testInvalidExtraComma:70->BaseTest.verifyException:481 Expected an exception with one of substrings ([expected a value]): got one with message "Unexpected character (']' (code 93)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: (ByteArrayInputStream); line: 1, column: 8]"
[ERROR] TestParserClosing.testReleaseContentBytes:110 expected:<6> but was:<-1>
[ERROR] JsonParserTest.testUtf8BOMHandling:443 expected:<0> but was:<-1>
[ERROR] SymbolTableMergingTest.testByteSymbolsWithClose:33->_testWithClose:87->_getParser:106 expected:<class com.fasterxml.jackson.core.json.UTF8StreamJsonParser> but was:<class com.fasterxml.jackson.core.json.ReaderBasedJsonParser>
[ERROR] SymbolTableMergingTest.testByteSymbolsWithEOF:39->_getParser:106 expected:<class com.fasterxml.jackson.core.json.UTF8StreamJsonParser> but was:<class com.fasterxml.jackson.core.json.ReaderBasedJsonParser>
[ERROR] Errors:
[ERROR] JsonParserTest.testBytesAsSource:418 » JsonParse Illegal character ((CTRL-CHAR...
[ERROR] TestSymbolTables.testByteBasedSymbolTable:149->_findSymbols:178 ClassCast com....
[INFO]
[ERROR] Tests run: 870, Failures: 9, Errors: 2, Skipped: 0
Or, rather than removing the entire block, the JsonFactory
in test classes such as LocationOffsetsTest
, BaseTest
etc. can be altered to disable the canonicalize_field_names
feature:
final JsonFactory JSON_F = new JsonFactory(new JsonFactoryBuilder()
.disable(TokenStreamFactory.Feature.CANONICALIZE_FIELD_NAMES));
Thank you for reporting this. Looks like there are different root causes for test failures; both paths should be checked probably.
Tricky one is the one wrt byte vs char offsets: tricky not because it'd be difficult to change test to match assumptions, but rather that behavior is actually bit weird from user perspective -- if you feed byte-based source, it'd seem reasonable to get byte-based offsets.
I'll have to think what a good solution would be. I don't think much can be done for 2.9 (except possibly changing tests).