httpx Respect chunked boundaries with response.iter

Respect chunked boundaries with response.iter_raw()

Open tomchristie opened this issue 3 years ago • 4 comments

Currently we don't make any particular claims about how much data will be returned by response.iter_raw(), or how the data ends up being chunked by default.

It might be neat to provide some tighter behaviour here, for example we could ensure that responses using chunked encoding always stream the data chunk-by-chunk rather than arbitrarily sized amounts.

h11 does actually give us enough information here in order for us to implement that, and ensure that the byte stream returned from httpcore corresponds to the chunk boundaries...

https://h11.readthedocs.io/en/latest/api.html?highlight=lowercase#chunked-transfer-encoding-delimiters

Alongside this we could also provide some controls on httpcore for the size of the socket read buffer, which has an implication on the typical size of chunks you'd get from iter_raw() on non-chunked responses.

Sep 10 '20 22:09 tomchristie

Would just like to say that I would love this feature. Let me know if you would like some help with implementing it.

Oct 14 '20 22:10 nhumrich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Feb 20 '22 15:02 stale[bot]

Still valid thx, @stale bot.

Feb 21 '22 13:02 tomchristie

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 25 '22 07:03 stale[bot]

Here's the relevant method in httpcore that deals with yielding the body content.

    def _receive_response_body(self, request: Request) -> Iterator[bytes]:
        timeouts = request.extensions.get("timeout", {})
        timeout = timeouts.get("read", None)

        while True:
            event = self._receive_event(timeout=timeout)
            if isinstance(event, h11.Data):
                yield bytes(event.data)
            elif isinstance(event, (h11.EndOfMessage, h11.PAUSED)):
                break

We can adapt it to make sure that it strictly yields complete chunks.

    def _receive_response_body(self, request: Request) -> Iterator[bytes]:
        timeouts = request.extensions.get("timeout", {})
        timeout = timeouts.get("read", None)
        chunk_buffer = []

        while True:
            event = self._receive_event(timeout=timeout)
            if isinstance(event, h11.Data):
                if (event.chunk_start and not event.chunk_end):
                    # If we get the start of a chunk, then switch to chunk buffering,
                    chunk_buffer = [bytes(event.data)]
                elif chunk_buffer:
                    # If we're buffering, then append the chunk.
                    chunk_buffer.append(bytes(event.data))
                    if event.chunk_end:
                        # Once the chunk is complete, yield it and clear our buffering state.
                        yield b''.join(chunk_buffer)
                        chunk_buffer = []
                else:
                    # This is our usual case.
                    yield bytes(event.data)
            elif isinstance(event, (h11.EndOfMessage, h11.PAUSED)):
                break

Or perhaps like this...

    def _receive_response_body(self, request: Request) -> Iterator[bytes]:
        timeouts = request.extensions.get("timeout", {})
        timeout = timeouts.get("read", None)

        # We could always just yield `event.data` when we see h11.Data events,
        # but we're careful to ensure that if `Transfer-Encoding: chunked`
        # is being used, then we yield data on a chunk-by-chunk basis.
        chunk_buffering = False
        chunk_buffer = []

        while True:
            event = self._receive_event(timeout=timeout)
            if isinstance(event, h11.Data):
                if (event.chunk_start and not event.chunk_end):
                    # If we get an opened but not closed chunk then we need
                    # need to switch to buffering.
                    chunk_buffering = True

                if chunk_buffering:
                    # If we're buffering, then append the chunk.
                    chunk_buffer.append(bytes(event.data))
                    if event.chunk_end:
                        # Once the chunk is complete, yield it and clear our buffering state.
                        yield b''.join(chunk_buffer)
                        chunk_buffer = []
                        chunk_buffering = False
                else:
                    # This is our usual case.
                    yield bytes(event.data)

            elif isinstance(event, (h11.EndOfMessage, h11.PAUSED)):
                break

I can observe the difference in behaviour with...

url = "http://www.httpwatch.com/httpgallery/chunked/chunkedimage.aspx?0.729209319487347"
with httpcore.stream("GET", url) as response:
    for part in response.stream:
        print(len(part))

Without the buffering...

$ python ./example.py 
1007  <-- This is a chunk split across two `h11.Data` events
17  <-- This is a chunk split across two `h11.Data` events
1024
1024
1024
...
1024
1024
885

With the buffering...

$ python ./example.py 
1024  <-- Two chunks get joined into a single part here.
1024
1024
1024
1024
...
1024
1024
885

It's aesthetically pleasing to deal with this. Is it worth the fix, tho. 🤷‍♂️

Oct 13 '22 15:10 tomchristie

Okay. Changing to this behaviour would require a fix in h11, which has a bug in it's chunked boundary API, as explained here... https://github.com/encode/httpcore/pull/595

That's pretty okay, actually, because the HTTP spec says "don't use this, don't rely on this". So... let's just not. But...

If anyone does care about this then... you have the pointer towards where a fix in h11 would be needed, and I'd be up for helping if required. If you want to actively work on that let me know and we'll consider reopening this ticket. Otherwise, it's a no from me.

Oct 25 '22 09:10 tomchristie

httpx httpx copied to clipboard

Respect chunked boundaries with response.iter_raw()

httpx
httpx copied to clipboard