jackson-databind Optimize `JsonNodeDeserialization` wrt recursion

Optimize `JsonNodeDeserialization` wrt recursion

Open cowtowncoder opened this issue 2 years ago • 9 comments

(note: cleaved off of #2816, used to be bundled)

Current implementation JsonNodeDeserialization is expensive for deeply nested Object and Array values as it uses recursion: so for each small additional nesting level -- for arrays, 2 bytes to encode [ and ] -- a new stack frame gets created. In practical terms this means that it is possible to exhaust JVM heap usage with document that has nesting in order of ten thousand(s) levels, depending on settings.

It should be possible to replace basic recursion, however, with iteration, to at least significantly reduce amplification: to prevent cheapest potential DoS concerns.

Feb 13 '22 22:02 cowtowncoder

Note: implementation was merged from branch exp/iterative-jsonnode-2.13 (hash f93fd41028b6efcc7c41401dd271aa7d81da6cf3 ?), included in release 2.13.0 -- this issue added retroactively for tracking purposes.

Feb 13 '22 22:02 cowtowncoder

Same issue occurs when using e.g. ObjectMapper.readTree(InputStream in) where in is a JSON String with plain opening and closing square brackets e.g. "[[[[[]]]]]". But with a depth of 50 million nested Arrays. It will take very long time until finishing deserialization or throwing an Exception (actually with 50 million I never reached the ending).

Looks like the recursion in BaseNodeDeserializer._deserializeContainerNoRecursion is the reason for that. Reproducable with jackson-databind=2.13.2.2.

I assume this is related to this issue or should I open a new bug issue for that? Will the provided fix cover that part as well?

Apr 04 '22 10:04 deniz-husaj

@denizhusaj #3416 (which fixed the CVE #2816) specifically fixed the StackOverflowError that can be caused by deeply nested JSON. This is not an issue for JsonNode anymore because in 2.13.0 the JsonNode impl was changed to be iterative.

However even the iterative implementation can still take a while if you feed it 50m arrays. It is simply a lot of tokens :)

Apr 04 '22 10:04 yawkat

@yawkat So there are no plans to forbid deeply nested JSON Arrays?

But somehow I still get a StackOverflowError when having deeply nested JSON Objects with input streams like {"abc0": {"abc1": {"abc2": {"abc3": {"abc4": {}}}}}} but with a depth e.g. of 50.000. Is this expected? Same scenario like in my comment before.

Apr 04 '22 11:04 deniz-husaj

@denizhusaj i cannot reproduce that issue. I tried a nested JsonNode object with 50000 levels like in your example. It parsed just fine.

Maybe your error comes from JsonNode.toString? That will still error, however that is more of a debugging method.

Apr 04 '22 11:04 yawkat

@yawkat yes sorry you are right, it comes from the toString()...

Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate com.fasterxml.jackson.databind.node.ObjectNode.toString()

But regarding deep nested JSONArrays there will be no depth limit?

Apr 04 '22 12:04 deniz-husaj

I can't say that, it's tatu's decision.

However I'm not convinced a depth limit would help with "long texts take a long time to parse" completely. In general you can also allocate a lot of objects without very deep json, e.g. [[[... 4000 levels...]],[[... 4000 levels...]],... thousands of repetitions...]. This will not run into the depth limits, but will still be fairly slow to parse (simply because there's lots of tokens).

There is one problem that is unique to deeply nested json in particular (as opposed to other ways of getting many tokens): This line limits the expansion of the ContainerStack to max 4000 elements, which means that you can get quite large allocations for every 4000 tokens. However there are always at most two of these arrays alive at a time, so it should not lead to overly large memory use, so it should not be a security risk. It does however reduce perf of parsing of that particular document.

Apr 04 '22 12:04 yawkat

@denizhusaj Could you file a separate issue for ObjectNode.toString() please? That sounds like a side issue that sounds worth addressing.

As to plans: yes, there is a plan but it'd be via lower level streaming API:

https://github.com/FasterXML/jackson-core/issues/637

since handling it for all distinct deserializers is more work, configurability, so this aspect (maximum input document size/complexity limits) seems better addresses with more general functionality. That said, I have not started work here; and while conceptually simple, actual high-performance implementation is not trivial, and there's a bit of API work to consider as well (wrt how to pass such configuration limit settings). API/performance comes into play when passing limits to parser/generator input/output stack implementations; there is no way to currently pass such info.

Apr 04 '22 18:04 cowtowncoder

@cowtowncoder yes sure https://github.com/FasterXML/jackson-databind/issues/3447

Apr 05 '22 08:04 deniz-husaj

jackson-databind jackson-databind copied to clipboard

Optimize `JsonNodeDeserialization` wrt recursion

jackson-databind
jackson-databind copied to clipboard