tornado HTTP client gives incorrect Content-Length for automatically decompressed responses

The application below has two handlers, the first of which serves a gzipped tarball at /hello.tar.gz, and the second of which tries to proxy /hello2.tar.gz to the first. But the second actually raises tornado.httputil.HTTPOutputError: Tried to write more data than Content-Length.

The reason is that the Tornado HTTP client has decompressed the Content-Encoding: gzip without adjusting the Content-Length field, resulting in an invalid response. A workaround is setting decompress_response=False.

When decompress_response is on, Tornado should adjust or remove the Content-Length field of decompressed responses, similarly to how it already removes the Content-Encoding field.

I would also argue that decompress_response is unexpected behavior that should really be off by default, even if it’s fixed, as it changes the semantic meaning of the response: Content-Encoding is not Transfer-Encoding.

import tarfile
import io
import tornado.ioloop
import tornado.web
import tornado.httpclient

class TgzHandler(tornado.web.RequestHandler):
    def get(self):
        b = io.BytesIO()
        with tarfile.open(fileobj=b, mode="w:gz") as tar:
            contents = b"Hello, world!\n"
            info = tarfile.TarInfo("hello.txt")
            info.size = len(contents)
            tar.addfile(info, io.BytesIO(contents))
        self.set_header("Content-Type", "application/x-tar")
        self.set_header("Content-Encoding", "gzip")
        self.write(b.getvalue())

class ProxyHandler(tornado.web.RequestHandler):
    async def get(self):
        resp = await tornado.httpclient.AsyncHTTPClient().fetch(
            "http://localhost:8888/hello.tar.gz"
        )
        for k, v in resp.headers.get_all():
            self.add_header(k, v)
        self.write(resp.body)

app = tornado.web.Application(
    [(r"/hello.tar.gz", TgzHandler), (r"/hello2.tar.gz", ProxyHandler)]
)
app.listen(8888)
tornado.ioloop.IOLoop.current().start()

Sep 24 '19 22:09 andersk

Content-Encoding is not Transfer-Encoding.

That's true in principle, but that ship has sailed. Content-Encoding: gzip is used everywhere as if it's a transfer-encoding, and processing it transparently is more in line with (most) users' expectations.

We probably should rename the received Content-Length header (if any) the same way we rename the Content-Encoding header. But note that this is already implementation-dependent (curl_httpclient doesn't do it) and we don't remove Transfer-Encoding: chunked from the headers when processing it. If your TgzHandler used chunked encoding, the ProxyHandler would pass on the transfer-encoding header but write a non-chunked response, breaking everything (I think; haven't tried it). Proxy-style usage will always have to be careful about exactly which headers it passes through from the upstream request.

Sep 27 '19 14:09 bdarnell

If your TgzHandler used chunked encoding, the ProxyHandler would pass on the transfer-encoding header but write a non-chunked response, breaking everything (I think; haven't tried it).

That’s true. Perhaps RequestHandler should automatically clear Transfer-Encoding when writing a non-chunked response, for the same reason that it already automatically sets Transfer-Encoding: chunked when writing a chunked response.

But Content-Length shouldn’t, in general, just be ignored. If you imagine a streaming version of ProxyHandler, you certainly want it to pass on Content-Length if it’s available, and that is indeed what happens naturally:

class ProxyHandler(tornado.web.RequestHandler):
    def initialize(self):
        self.headers = None
        self.headers_added = False

    def header_callback(self, line):
        if self.headers is None:
            start_line = tornado.httputil.parse_response_start_line(line)
            self.set_status(start_line.code, start_line.reason)
            self.headers = tornado.httputil.HTTPHeaders()
        else:
            self.headers.parse_line(line)

    def add_headers(self):
        if not self.headers_added:
            for name, value in self.headers.get_all():
                self.clear_header(name)
            for name, value in self.headers.get_all():
                if name != "Transfer-Encoding":
                    self.add_header(name, value)
            self.headers_added = True

    def streaming_callback(self, chunk):
        self.add_headers()
        self.write(chunk)
        self.flush()

    async def get(self):
        await tornado.httpclient.AsyncHTTPClient().fetch(
            "http://localhost:8888/hello.tar.gz",
            header_callback=self.header_callback,
            streaming_callback=self.streaming_callback,
            # decompress_response=False,
        )
        self.add_headers()

This correctly passes on Content-Length if available and Transfer-Encoding: chunked if not—unless we were given an incorrect Content-Length because of decompress_response.

Sep 27 '19 22:09 andersk

Perhaps RequestHandler should automatically clear Transfer-Encoding when writing a non-chunked response, for the same reason that it already automatically sets Transfer-Encoding: chunked when writing a chunked response.

The idea is that applications should be free to implement transfer-encodings themselves, without Tornado's knowledge or interference. If you pass a Transfer-Encoding header (or a Content-Encoding or Content-Type for that matter), you're expected to produce a response that conforms to that encoding/type.

It seems a similar argument applies to Transfer-Encoding: we're not giving the application a body representation that includes the chunked framing, so when we decode chunked encoding we should remove that from the headers we deliver to the application (perhaps storing it in a renamed header so there's some record of the transformation happening).

Oct 21 '19 02:10 bdarnell

tornado tornado copied to clipboard

HTTP client gives incorrect Content-Length for automatically decompressed responses

tornado
tornado copied to clipboard