aiohttp icon indicating copy to clipboard operation
aiohttp copied to clipboard

aiohttp decompresses 'Content-Encoding: gzip' responses

Open JustAnotherArchivist opened this issue 4 years ago • 17 comments

Long story (not so) short

aiohttp automatically decompresses responses with a Content-Encoding header. I believe this is incorrect – while decompression may of course be useful, it should not be on by default.

The Content-Encoding header is similar to the Transfer-Encoding one, and the two are often confused. The difference is that TE is about the data transfer, i.e. the encoding for a particular connection and client, whereas CE is about the data itself. CE exists to allow specifying this encoding without losing the MIME type of the actual data.

As an example, an ideal response for a .tar.gz file download would have a Content-Type: application/x-tar header and a Content-Encoding: gzip one (rather than e.g. Content-Type: application/x-gzip without Content-Encoding). When downloading such a file, one would still expect to get the .tar.gz file, not an uncompressed .tar. This is why I think that aiohttp should not automatically decompress on a Content-Encoding header.

Steps to reproduce

Make a request for a file that has been gzipped, and read from the ClientResponse object (e.g. to write it to a file).

Expected behaviour

Get the gzipped file.

Actual behaviour

Returns the decompressed file.

Your environment

All versions are affected (oldest checked: 2.3.10, but presumably much older than that).

Additional notes

curl also used to have this bug in their HTTP/2 implementation. I haven't searched in detail whether it existed in the HTTP/1.1 one as well, but current versions do not decompress such responses.

JustAnotherArchivist avatar Dec 20 '19 10:12 JustAnotherArchivist

see https://github.com/aio-libs/aiohttp/issues/1992

thehesiod avatar Feb 14 '20 00:02 thehesiod

I'm against. Because it predates RFC. I don't see any real case where we should not decompress automatically. The Content-Encoding header is not related to data contents actually. It is connected with a way data is transferred over HTTP.

HTTP servers that serve data that supposed to be used in compressed form, should not set Content-Encoding: gzip.

I'm closing the issue. Please reopen if you don't think so.

socketpair avatar Feb 14 '20 13:02 socketpair

there is a real case, where you're downloading a gzip file and you want to store the gzip file. I think this is already supported though if you follow the issue I linked to

thehesiod avatar Feb 15 '20 06:02 thehesiod

@socketpair I'm not sure what you mean by "it predating RFC". Content-Encoding was specified in HTTP/1.1 in 1999 (here), and as explained in the original issue, it is not the same as Transfer-Encoding (here). The latter is for "the way data is transferred over HTTP", but Content-Encoding is about the actual data itself. Roy T. Fielding, one of the authors of RFC 2616, also clarified this a few times (e.g. here). Unfortunately, over the last two decades, the headers have been misused and their use misadvised countless times, which is what causes the current mess. Nevertheless, CE is for the data, TE is for the transfer, and the proper headers for a .tar.gz, which you would expect to be written to disk as a compressed file, would include CE. If you use aiohttp for such a download, you'd instead only get the .tar.

Please note that I'm not against having an option to decompress bodies with a compresseion CE since there absolutely are valid use cases for that. I'm arguing that this decompression should not happen by default.

@thehesiod Without having tested it, according to the comments in your PR #2110, that disables decoding of both content and transfer encoding. While I'm sure there are use cases for that, in my opinion, only content encoding should be disabled.

JustAnotherArchivist avatar Feb 17 '20 14:02 JustAnotherArchivist

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding

The Content-Encoding entity header is used to compress the media-type. When present, its value indicates which encodings were applied to the entity-body. It lets the client know how to decode in order to obtain the media-type referenced by the Content-Type header.

If someone wants to serve .tar.gz, he MUST NOT use "Content-Encoding: gzip".

For example: https://github.com/aio-libs/aiohttp/archive/master.tar.gz

$ curl -L -v -v --compressed https://github.com/aio-libs/aiohttp/archive/master.tar.gz -o/dev/null 2>&1 | egrep -i 'content-|encoding'
.......
> Accept-Encoding: deflate, gzip
< Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
< Vary: Authorization,Accept-Encoding
< X-Content-Type-Options: nosniff
< Content-Type: application/x-gzip
< Content-Disposition: attachment; filename=aiohttp-master.tar.gz
< Content-Length: 833426

I would revert https://github.com/aio-libs/aiohttp/pull/2110 . I don't see any reasons to have such an option. If S3 server (or its settings) is wrong, we should we care?

For example, this is a hack: https://www.rightbrainnetworks.com/2013/03/21/serving-compressed-gzipped-static-files-from-amazon-s3-or-cloudfront/

Here you are uploading pre-gzipped contents, instruct S3 to add header and then expect a browser to decompress. It works. Everything OK.

But if you want to download gzipped contents (you possibly need it for the S3 protocol) you have to force server not to send headers you set up before. Because these headers are for end-users, not for S3-aware programs.

S3-aware programs should use ?response-content-encoding=identity or so.

https://botocore.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

socketpair avatar Feb 17 '20 15:02 socketpair

No, serving a .tar.gz with Content-Type: application/x-tar and Content-Encoding: gzip is perfectly valid. Here's a little minimal web server I wrote a couple months ago for testing purposes. Downloading http://127.0.0.1:8080/f.tar.gz with Firefox, wget, or curl correctly produces a compressed file (not a valid tar though because I was too lazy for that).

From section 7.2.1 of RFC 2616 (emphasis mine):

Content-Type specifies the media type of the underlying data. Content-Encoding may be used to indicate any additional content codings applied to the data, usually for the purpose of data compression, that are a property of the requested resource.

Serving it with Content-Type: application/x-gzip, like GitHub does, is not a violation of the specification (nor would serving it as application/octet-stream be), but it means a loss of header information that the body is a tar file.

JustAnotherArchivist avatar Feb 17 '20 16:02 JustAnotherArchivist

@JustAnotherArchivist I've updated the comment, sorry for the long delay.

socketpair avatar Feb 17 '20 16:02 socketpair

I think that S3 usage is okay, actually. The browser decompresses it if the file is used in a <link rel="stylesheet"> tag because it needs the actual text/css data. This is also covered in the RFC in section 14.11: "Typically, the entity-body is stored with this encoding and is only decoded before rendering or analogous usage." The same thing would apply for compressed images embedded in a page. I added an example of this to the same Gist linked above. Requesting http://127.0.0.1:8080/index.html in Firefox brings up a page with a bright-green background, even though the CSS file is served with gzip CE. If you directly download style.css with wget or curl, however, you get the compressed file, as expected.

JustAnotherArchivist avatar Feb 17 '20 16:02 JustAnotherArchivist

Regarding curl, it has --compressed. This option makes it understand content-encoding the same way as a browser. The same for wget , but --compression=auto.

Well. We may argue more and more. Aiohttp has auto_decompress option. If you still speak about changing the default value, I'm still against it.

P.S.

The reason why I reopen the issue: I figured out, that browsers (at least, Firefox) DOES NOT decompress such responses when they download URL (not render).

@asvetlov I don't know what to do. Both points of view have pros and cons. RFC actually is not clear enough.

socketpair avatar Feb 17 '20 16:02 socketpair

@socketpair I agree with the original bug reporter, he is correct. This is a bug. You should fix it.

This is a bug impacting Funtoo Linux due to http://cdn.postfix.johnriley.me/mirrors/postfix-release/official/postfix-3.6.1.tar.gz. This is a CDN serving a gzip-compressed tarball. The artifact being served over HTTP is "postfix-3.6.1.tar.gz". Relevant headers are:

Content-Encoding: gzip Content-Length: 4748754 Content-Type: application/x-tar

aiohttp returns a 21MB file, uncompressed. In our case, getting the exact binary data of the postfix-3.6.1.tar.gz file is essential -- gzip is not just a transport feature (aka Transfer-Encoding) to reduce bandwidth but we are requesting the contents of "the gzip compressed file" and we need it to match perfectly what is on the origin server because we must compute SHA512 digest of this file. So we need the original compressed version -- we can't recompress it because our gzip might use different settings and get a different binary and thus different SHA512.

What you should do by default is not transparently decompress based on Content-Encoding. CE is for data, TE is for transfer. End of story. If you need to support this, you should add a feature for auto-decompression of content-encoded gzip which could be useful in some cases but it should not be the default behavior.

We must respect the distinct meaning of CE and TE, and the concepts of "content" and "transfer". When I download, I want the "content".

Upstream bug: https://bugs.funtoo.org/browse/FL-8477

danielrobbins avatar Jun 15 '21 16:06 danielrobbins

Looking into this more, it seems like the reality of HTTP servers and their handling of Content-Encoding is not always according to what we would hope, or in alignment with HTTP specs.

A workaround for our specific issue was to specify "Accept-Encoding: identity" in the request, which informed the CDN to send the file as-is with any Content-Encoding (it is already compressed on disk, so this is fine) thus avoiding the issue.

danielrobbins avatar Jun 21 '21 06:06 danielrobbins

I've used the following workaround on a server that is wrong in guessing content encoding via mimetypes.guess_type():

async def on_response_prepare(request, response):
    if not response.compression:
        response.headers.popall("Content-Encoding", None)
    return response

app.on_response_prepare.append(on_response_prepare)

wodny avatar Sep 16 '21 12:09 wodny

I think we cannot change our defaults but should provide an option to disable encoding on-demand. I've missed is the question about client, server, or both? The client has auto_decompress option already, would be nice to accept this flag in client.request() also. Regarding the server, we need some adjustments. For example, support request.auto_decompress property and request.clone(auto_decompress=False)

asvetlov avatar Oct 30 '21 16:10 asvetlov

@wodny from my understanding, we are talking about decompressing incoming data. But in your snippet, you are tweaking outcoming response.

asvetlov avatar Oct 30 '21 16:10 asvetlov

@wodny from my understanding, we are talking about decompressing incoming data. But in your snippet, you are tweaking outcoming response.

Probably should have elaborated more on that. My setup was summetric i.e. aiohttp on server and client. Instead of modifying all clients I modified the server (as a workaround) but affected all the clients. If a server were built with another library probably an analogous solution would exist.

wodny avatar Oct 30 '21 17:10 wodny

There is seemingly a ton of ambiguity even in the spec (RFC2616 ) itself.

Transfer-Encoding

The Transfer-Encoding general-header field indicates what (if any) type of transformation has been applied to the message body in order to safely transfer it between the sender and the recipient. This differs from the content-coding in that the transfer-coding is a property of the message, not of the entity.

....

The Internet Assigned Numbers Authority (IANA) acts as a registry for transfer-coding value tokens. Initially, the registry contains the following tokens: "chunked" (section 3.6.1), "identity" (section 3.6.2), "gzip" (section 3.5), "compress" (section 3.5), and "deflate" (section 3.5).

New transfer-coding value tokens SHOULD be registered in the same way as new content-coding value tokens (section 3.5).

A server which receives an entity-body with a transfer-coding it does not understand SHOULD return 501 (Unimplemented), and close the connection. A server MUST NOT send transfer-codings to an HTTP/1.0 client.

https://datatracker.ietf.org/doc/html/rfc2616#section-14.41

OK, so transfer-codings and content-codings are different, independent things, except that those links in the second half there go to the content-codings section.

14.3 Accept-Encoding

The Accept-Encoding request-header field is similar to Accept, but restricts the content-codings (section 3.5) that are acceptable in the response.

https://datatracker.ietf.org/doc/html/rfc2616#section-14.3

So if transfer codings are not the same as content codings, and Accept-Encoding is used for negotiating the content coding, and no equivalent exists for negotiating the transfer encoding.... so Transfer-Encoding is basically only useful for structural aspects of the message which you can assume are supported on either side based upon the HTTP protocol being used. Because there's no negotiation mechanism. And since the spec rarely changes, the small number of allowable values of Transfer-Encoding never changes.

So something like Transfer-Encoding: chunked makes sense, but using it for transparent compression seems like it very much does not make sense, because there's no way to negotiate. And it's unclear why gzip would necessary for safe transfer between the sender and the recipient in any case.

And HTTP 2.0 kills off Transfer-Encoding completely: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding https://www.rfc-editor.org/rfc/rfc7540#section-8.1.2.2

So as a practical matter nobody ever supported Transfer-Encoding according to the spec, and as a theoretical matter it's not all that difficult to see why. In any case, all practical resources including the MDN docs in particular use Content-Encoding this way.

dralley avatar Apr 18 '23 20:04 dralley

IMO, the problem and solution as written in the OP here is not correct. It makes it seem like a client problem but there is nothing wrong with decompressing a representation by default. And as pointed out, there is the auto_decompress flag to reverse that.

However, I do think it points to a change that should be made to the server. While there is understandably great freedom in the specs as to how a request should translate to a response, the response is supposed to be some "representation" of the request.

Currently, if the request is for a tarball, e.g. /example.tar.gz, the server returns a Content-Type of application/x-tar with a gzip Content-Encoding. In other words, it behaves the same as if /example.tar were the request. Since the type is supposed to refer to a representation of the request, this is not correct. It should instead be application/gzip with no encoding (as no additional encoding was applied to the representation). Most clients, including browsers, would then just download the tarball as is without decompression as expected.

Note this is exactly the response you get when downloading a repo tarball from GitHub for example. As another data point, this is also the behavior in django (see https://github.com/django/django/pull/16642).

steverep avatar May 13 '24 16:05 steverep