jackson-core Can't tell non-blocking parser what charset to use for decoding input

Can't tell non-blocking parser what charset to use for decoding input

Open mizosoft opened this issue 4 years ago • 3 comments

I'm using Jackson's non-blocking parser to implement a BodySubscriber for use with Java's non-blocking HTTP client. The parser is created by JsonFactory#createNonBlockingByteArrayParser() using the factory instance associated with the ObjectMapper . It's working like a charm, but it seems that it uses UTF-8 by default and there is no way of telling it other encodings to use (such as the encoding specified by the response headers other than UTF-8).

I figured it might auto-detect the response body's encoding like it's the case with other parsers, but it turned out that it assumes all input is UTF-8. For example, this snippet would crash:

    ObjectMapper mapper = new JsonMapper();
    byte[] jsonBytes = "{\"Psst!\": \"I'm not UTF-8\"}".getBytes(StandardCharsets.UTF_16);
    JsonParser asyncParser = mapper.getFactory().createNonBlockingByteArrayParser();
    ByteArrayFeeder feeder = ((ByteArrayFeeder) asyncParser.getNonBlockingInputFeeder());
    feeder.feedInput(jsonBytes, 0, jsonBytes.length);
    feeder.endOfInput();
    Map<String, String> map = mapper.readValue(asyncParser, new TypeReference<>() {});
    System.out.println(map);

It works fine if the JSON string is encoded with UTF-8.

Feb 06 '20 03:02 mizosoft

Yes, you can not define other encodings so it only works for UTF-8 and 7-bit ASCII (since that is a subset). This is a fundamental limitation and it is unlikely implementations for other encodings would be added. If support was to be added it would likely require version that handles byte-to-character encoding separate from tokenization, and that would be full rewrite.

So: non-blocking parser will only work on UTF-8 input. I should probably mention this better in Javadocs.

Feb 06 '20 18:02 cowtowncoder

I see...

I think in my case then I should use the non-blocking parser only if the response charset is UTF-8 or a subset of it, else fallback to loading the response as a string and deserialize from there. I agree that the Javadocs should mention this to clear up confusion.

Feb 07 '20 03:02 mizosoft

Right. Vast majority of JSON really should be UTF-8, especially considering that only officially standard legal charsets are UTF-8, UTF-16 and UTF-32 (as per original JSON specification). But there are so many broken systems that emit other encodings (UTF-8859-x) that.... it is frustrating. Considering that JSON document itself has no mechanism for declaring encoding -- unlike XML which has this capability! -- so documents are not stand-alone any more. But if sticking to standard supported encodings, auto-detection does work (UTF-16 and UTF-32 can be auto-detected, distinct from UTF-8; Latin-1 and others can not).

Feb 07 '20 17:02 cowtowncoder

jackson-core jackson-core copied to clipboard

Can't tell non-blocking parser what charset to use for decoding input

jackson-core
jackson-core copied to clipboard