req icon indicating copy to clipboard operation
req copied to clipboard

Streaming decompression?

Open adamwight opened this issue 1 month ago • 8 comments

I was surprised to find that the decompress_body step seems to skip decompression when request.into is specified. My use case is to stream a gzipped file through decompression and then into text processing logic, so my expectation was that into: IO.stream() would result in a stream of decompressed text. Instead, it results in an error:

** (ArgumentError) argument error
    (stdlib 6.2.2) io.erl:203: :io.put_chars(:standard_io, <<31, 139, 8, 0, 0, 9, 110, 136, 0, 255, 236, 86, 109, 111, 219, 54, 16, 206, 231, 253, 10, 142, 64, 191, 233, 133, 146, 223, 53, 24, 67, 154, 13, 216, 128, 109, 109, 151, 98, 64, 91, 21, 6, 77, 158, 98, 46, 18, 169, 145, ...>>)
    (req 0.5.16) lib/req/finch.ex:146: anonymous fn/3 in Req.Finch.finch_stream_into_collectable/5
    (finch 0.20.0) lib/finch.ex:377: anonymous fn/3 in Finch.stream/5
    (finch 0.20.0) lib/finch/http1/conn.ex:347: Finch.HTTP1.Conn.receive_response/8
    (finch 0.20.0) lib/finch/http1/conn.ex:131: Finch.HTTP1.Conn.request/8
    (finch 0.20.0) lib/finch/http1/pool.ex:71: anonymous fn/10 in Finch.HTTP1.Pool.request/6
    (nimble_pool 1.1.0) lib/nimble_pool.ex:462: NimblePool.checkout!/4
    iex:4: (file)

This matches the documented behavior which is great, but maybe I misunderstand the point of this limitation. Is streaming decompression simply a lacking feature, or is there a good reason to avoid it?

adamwight avatar Nov 18 '25 07:11 adamwight

I started working around this by writing streaming decompression as an :into function, but now I've also found that the compress step skips the accept-encoding header when streaming is in use. This makes sense as a default, but perhaps it should be overridable?

adamwight avatar Nov 18 '25 07:11 adamwight

Yes, compress/1 would not set accept-encoding in that case so you would need to send that yourself.

I believe I couldn’t figure out how to do streaming with decompression so that’s why that’s not supported.

wojtekmach avatar Nov 18 '25 08:11 wojtekmach

Thanks for the explanation! In my application I might try something with low-level :zlib commands, if it look reusable I can post the patch here.

adamwight avatar Nov 18 '25 09:11 adamwight

Now I'm thinking that decompression should always be streamed in order to utilize network and processor in parallel. That becomes a significant speed-up when fetching large, compressed files, so we don't have to wait for a monolithic decompression after the file is transferred.

The basic code to inflate a stream is straightforward, although some compression libraries might not include a streaming interface so these would fall back to collect the data and decompress at the end. For gzip streaming looks something like this:

# Initialization
z = :zlib.open()
:ok = :zlib.inflateInit(z, 16 + 15, :reset)
# Read loop
out = :zlib.inflate(z, in)
...
# Cleanup
:ok = :zlib.inflateEnd(z)
:ok = :zlib.close(z)

But I'm not sure how to fit this into the "step" abstraction. So far, response steps are written to assume a complete response.body, and :into support is something separate which calls Finch.stream in a tight loop inside the adapter.

One workaround within the current architecture might be to implement decompression by wrapping the caller's :into (or nil/identity) in a replacement :into to transform the stream, but then we're just shoving a different abstraction into a random place, I feel. This would be better done as a full Plug so that initialization and cleanup is more explicit, anyway. Maybe I'm just stuck in a pre-req way of thinking?

(This whole area seems to be a work in progress in Finch as well, its docs mention that streaming lacks back-pressure. I wonder if a Broadway or GenStage producer could be provided and wired through Req up to the caller when :into is compatible?)

adamwight avatar Nov 18 '25 18:11 adamwight

Right, streaming response body doesn't fit into the steps as they are right now. Or to be more precise it doesn't fit into all of them. For example, it's perfectly fine to have a redirect/1 or retry/1 step which just needs http status and headers.

An obvious solution would be to have some kind of resp.body which is a stream, an Enumerable. And the decompress/1 "does Stream.map on it", and so does decode_body/1. But such resp.body is not possible today without adding a separate process which I'd rather avoid.

About wrapping caller :into, that's kinda how https://hexdocs.pm/req/Req.Steps.html#checksum/1 is implemented today. The implementation has limitations (does not support into: :self) and in general I'm not super proud of how it all turned out but here we are. The way it's architected right now is we technically can have transformations on the streaming response body but we need to know those transformations before we start streaming (since we use Finch.stream which takes a function what to do with those chunks, a "push" model if you will) so by definition these cannot be "response steps". I haven't gone very far with it but one idea was to indeed implement decompress/1 and decode_body/1 as request steps, which all they do is register what to do with streaming response body. But then should probably move away from response steps (and error steps I guess) and move towards just a single type of steps that can compose transformations on resp.body and/or have lifecycle hooks akin to https://hexdocs.pm/plug/Plug.Conn.html#register_before_send/2.

About Finch, just to be precise, I'd differentiate between streaming (https://hexdocs.pm/finch/Finch.html#stream/5) which has back-pressure (it uses Mint.HTTP.recv under the hood, at least on HTTP/1) and async (https://hexdocs.pm/finch/Finch.html#async_request/3) which hasn't.

But yeah, in any case, my goal with Req was to implement as much as possible, ideally everything, as steps, and since I can't square streaming and decompression, this particular feature is not yet implemented.

wojtekmach avatar Nov 18 '25 22:11 wojtekmach

my goal with Req was to implement as much as possible, ideally everything, as steps

I like this and it was part of why I switched my consumer from Tesla to Req, it immediately improved both the application logic and tests. And I agree with going further in this direction if possible—when I squint at the question it really feels like the :into mechanism could be extracted into its own step and hidden from checksum, decompression, etc, in the way you suggested, I'll give that a try. If you have an experimental branch already sitting around, please do mention.

Simplifying request and response steps sounds great, I especially like the idea of turning run_request(%{current_request_steps: []} = request) into a normal step adapter(request) which is making changes inside of the request to transform incoming parameters into a live connection feeding data chunks into an "into". But it seems like the steps would still exhibit a request/response dichotomy... it's something to play around with.

About Finch, just to be precise [...]

Thanks for the correction, I understand a bit better now and it's a key distinction for how the streaming will be wired back up through Req. My naïve approach would be to consistently wrap "into" as a function which always points towards the caller but this only provides backpressure when the function is pipelining through a single process, for example. If this function were to send messages asynchronously (ie. into: :self) then the incoming queue could grow without limit. I guess it could be enough to ensure that the Req default is providing backpressure and concurrency of 1 and then the consumer can decide whether to take over that responsibility (eg. wire to a pooled processor, do something clever for full HTTP/2, etc.) or not.

adamwight avatar Nov 20 '25 06:11 adamwight

Even if the difference between request and response steps goes away, we would still need a way to order certain steps. Maybe an after option: Req.Request.attach_step(request, my_app: &MyApp.step/1, after: :adapter).

Collapsing error steps into the same abstract seems almost possible as well. Maybe it's okay to run through all of the "response" steps even when an error occurs? It looks like the separate error flow is coupled to a "retry" concept, the big affordance is that the error step can cause run_response to start over again. Perhaps in the more consolidated mechanism a retry step would tail-recurse into the entire request stack again.

adamwight avatar Nov 20 '25 08:11 adamwight

it immediately improved both the application logic and tests

+569 −1188 love to see it!

when I squint at the question it really feels like the :into mechanism could be extracted into its own step

I don't think so but I'd love to be proven wrong! In theory Req supports other adapters in Finch but in practice, it only does so unless you use :into and into: :self in particular; there's a contract but it's not documented and well tested with an alternative implementation. That is to say, :into <> run_finch/1 are very much tied but I concede separating them out would almost definitely be beneficial.

Just to set expectations, even though there are limitations, it's unlikely I'll make fundamental changes around steps and I intend to ship Req v1.0 with the existing architecture. However, if there's a clear reason to pivot, that's why there hasn't been a v1 yet.

wojtekmach avatar Nov 20 '25 10:11 wojtekmach