spring-framework icon indicating copy to clipboard operation
spring-framework copied to clipboard

Streaming parsing of JSON array in Spring WebClient

Open HaloFour opened this issue 5 years ago • 10 comments

I found #21862 which is pretty close to my request but closed.

I am currently using Spring WebClient with Spring Boot 2.2.6 and Spring Framework 5.2.5 writing a service that sits in front of a number of other upstream services and transforms their response for public consumption. Some of these services respond with very large JSON payloads that are little more than an array of entities wrapped in a JSON document, usually with no other properties:

{
    "responseRoot": {
        "entities": [
            { "id": "1" },
            { "id": "2" },
            { "id": "n" },
        ]
    }
}

There could be many thousands of entities in this nested array and the entire payload can be tens of MBs. I want to be able to read in these entities through a Flux<T> so that I can transform them individually and write them out to the client without having to deserialize all of them into memory. This doesn't appear to be something that Spring WebFlux supports out of the box.

I'm currently exploring writing my own BodyExtractor which reuses some of the code in Jackson2Tokenizer to try to support this. My plan is to accept a JsonPointer to the location of the array and then parse asynchronously until I find that array, then to buffer the tokens for each array element to deserialize them.

var flux = client.get()
    .uri(uri)
    .exchange()
    .flatMapMany(r ->
        r.body(new StreamingBodyExtractor(JsonPointer.compile("/responseRoot/entities")))
    );

Before I go too far down this path I was curious if this was functionality that Spring would be interested in supporting out of the box.

Similarly, I was curious about the functionality of being able to stream out a response from a WebFlux controller via a Flux<T> where the streamed response would be wrapped in a JSON array and possibly in a root JSON document as well?

HaloFour avatar Apr 21 '20 14:04 HaloFour

Here's a very quick&dirty implementation of the BodyExtractor implementation:

https://gist.github.com/HaloFour/ce3063d4e693b495e3c194cbb2f66686

The actual token parsing could certainly be cleaned up but it gets the job done at least to the extent that existing integration tests in the project are passing.

HaloFour avatar Apr 23 '20 14:04 HaloFour

Also, not to pile up additional requests in a single issue, but I didn't see a way to use a BodyExtractor with retrieve() which would force me to manually interpret the HTTP status error codes. Is there a reason WebClient.ResponseSpec doesn't include a method that accepts a BodyExtractor?

HaloFour avatar Apr 23 '20 14:04 HaloFour

@HaloFour thanks for the proposal.This looks feasible and probably worth doing but mainly I'm wondering about what a more general solution looks like and how much more general does it need to be.

For example the case of multiple arrays such as in #21862. We could accept multiple JSON pointers but it's less obvious how to represent the output which logically is Flux<T1>, Flux<T2>, etc but needs to be exposed sequentially, i.e. Flux<Flux<?>> which is not great for generics and it might as well be Flux<Object> where the application has to check the Object type and downcast accordingly. An even more challenging question is what if you want to extract the surrounding Object structure as in #25472?

rstoyanchev avatar Oct 05 '20 11:10 rstoyanchev

Thanks for this, @HaloFour ! Looks like something I was looking for (hence #25472). I'll give your Gist a try.

@rstoyanchev (just reiterating from #25472 ) I think it makes sense to focus on the most common case of a single array of a single type of object in the JSON response. The semantics of anything else, like you explain, becomes very hairy very quickly and the applicability of it seems low for most real world scenarios (imho).

fransflippo avatar Oct 05 '20 18:10 fransflippo

Thanks for taking a look! Here's a newer Gist based on the code that we're currently using in production.

HaloFour avatar Oct 05 '20 18:10 HaloFour

Yes it make sense to do something that would solve many cases. That said other possible cases are not that far to see. Take for example #21862 or even for Elasticsearch isn't it necessary sometimes to access something else besides the hits, like "search_after"?

rstoyanchev avatar Oct 06 '20 15:10 rstoyanchev

going back to the original question, with the new API, exactly how do we extract the entities under responseRoot ?

joedevgee avatar Jul 12 '22 20:07 joedevgee

toEntityFlux(streamingBodyExtractor.toFlux(MyClass.class, JsonPointer.compile("/pathToArray"))) worked for me. This seems very useful. Any chance this BodyExtractor can be added to Spring?

nilsga avatar Nov 14 '23 10:11 nilsga

for the original use case of json-pointing to an array in order to stream-parse it, I think it would be better to delegate that responsibility to Jackson and probably just offer an lightweight BodyExtractor adapter in Framework.

Unfortunately, even though in Jackson-Core there is a FilteringParserDelegate which can accept a JsonPointerBasedFilter, this doesn't work for async parsers for now (see https://github.com/FasterXML/jackson-core/issues/1144)...

@HaloFour maybe there's an opportunity to contribute something there?

simonbasle avatar Dec 05 '23 14:12 simonbasle

Sure, I can take a look at that.

HaloFour avatar Dec 05 '23 15:12 HaloFour

for the original use case of json-pointing to an array in order to stream-parse it, I think it would be better to delegate that responsibility to Jackson and probably just offer an lightweight BodyExtractor adapter in Framework.

Unfortunately, even though in Jackson-Core there is a FilteringParserDelegate which can accept a JsonPointerBasedFilter, this doesn't work for async parsers for now (see FasterXML/jackson-core#1144)...

@HaloFour maybe there's an opportunity to contribute something there?

Just FYI, that issue is closed but they didn't actually implement support for filtering during async parsing, they just throw an exception instead of staying in the infinite loop.

asardaes avatar Jul 23 '25 08:07 asardaes