jackson-core
jackson-core copied to clipboard
Can't tell non-blocking parser what charset to use for decoding input
I'm using Jackson's non-blocking parser to implement a BodySubscriber
for use with Java's non-blocking HTTP client. The parser is created by JsonFactory#createNonBlockingByteArrayParser()
using the factory instance associated with the ObjectMapper
. It's working like a charm, but it seems that it uses UTF-8
by default and there is no way of telling it other encodings to use (such as the encoding specified by the response headers other than UTF-8
).
I figured it might auto-detect the response body's encoding like it's the case with other parsers, but it turned out that it assumes all input is UTF-8
. For example, this snippet would crash:
ObjectMapper mapper = new JsonMapper();
byte[] jsonBytes = "{\"Psst!\": \"I'm not UTF-8\"}".getBytes(StandardCharsets.UTF_16);
JsonParser asyncParser = mapper.getFactory().createNonBlockingByteArrayParser();
ByteArrayFeeder feeder = ((ByteArrayFeeder) asyncParser.getNonBlockingInputFeeder());
feeder.feedInput(jsonBytes, 0, jsonBytes.length);
feeder.endOfInput();
Map<String, String> map = mapper.readValue(asyncParser, new TypeReference<>() {});
System.out.println(map);
It works fine if the JSON string is encoded with UTF-8
.
Yes, you can not define other encodings so it only works for UTF-8 and 7-bit ASCII (since that is a subset). This is a fundamental limitation and it is unlikely implementations for other encodings would be added. If support was to be added it would likely require version that handles byte-to-character encoding separate from tokenization, and that would be full rewrite.
So: non-blocking parser will only work on UTF-8 input. I should probably mention this better in Javadocs.
I see...
I think in my case then I should use the non-blocking parser only if the response charset is UTF-8
or a subset of it, else fallback to loading the response as a string and deserialize from there. I agree that the Javadocs should mention this to clear up confusion.
Right. Vast majority of JSON really should be UTF-8, especially considering that only officially standard legal charsets are UTF-8, UTF-16 and UTF-32 (as per original JSON specification). But there are so many broken systems that emit other encodings (UTF-8859-x) that.... it is frustrating. Considering that JSON document itself has no mechanism for declaring encoding -- unlike XML which has this capability! -- so documents are not stand-alone any more. But if sticking to standard supported encodings, auto-detection does work (UTF-16 and UTF-32 can be auto-detected, distinct from UTF-8; Latin-1 and others can not).