attohttpc icon indicating copy to clipboard operation
attohttpc copied to clipboard

'Corrupt deflate stream' error on some websites (curl and Firefox are fine)

Open Shnatsel opened this issue 3 years ago • 16 comments

Some websites, such as hajime.us, fail to load using attohttpc: Io Error: corrupt deflate stream. They load fine using Firefox and the curl command-line tool.

Tested using this code. Test tool output from all affected websites: attohttpc-deflate-corrupt-stream.tar.gz

40 websites out of the top million from Feb 3 Tranco list are affected.

I suspect this is an issue with the underlying DEFLATE implementation, but assistance in isolating the failure (e.g. dumping the DEFLATE stream so I could report a bug against miniz_oxide) would be appreciated.

Shnatsel avatar Feb 05 '21 15:02 Shnatsel

If I understand things correctly, you should be able to get the compressed response using allow_compression(false) while manually inserting the relevant header using header("Accept-Encoding", "gzip, deflate").

adamreichold avatar Feb 05 '21 20:02 adamreichold

I tried http://landolts.com and I get a corrupt deflate stream error with the rust_backend as well as the miniz-sys features of the flate2 crate. Using the zlib does not yield a decompression error but an unexpected EOF instead.

In all cases, the actual body seems to be decompressed completely. I therefore wonder whether the content length reported by the server is correct...

adamreichold avatar Feb 05 '21 20:02 adamreichold

Here is the body of the above request attached: body.gz

It came with the headers:

server: "nginx"
date: "Fri, 05 Feb 2021 20:28:22 GMT"
content-type: "text/html; charset=UTF-8"
content-length: "6068"
connection: "close"
x-powered-by: "PHP/7.0.0p1"
set-cookie: "PHPSESSID=ac27hsia4s1obmtvrk6jetrf40; path=/"
set-cookie: "mobile=false; path=/"
set-cookie: "user-agent=330cf4ec2a9149ebd093962feb701e34; path=/"
expires: "Mon, 26 Jul 1997 05:00:00 GMT"
cache-control: "no-store, no-cache, must-revalidate"
cache-control: "post-check=0, pre-check=0"
pragma: "no-cache"
last-modified: "Fri, 05 Feb 2021 20:28:22 GMT"
content-encoding: "gzip"
vary: "Accept-Encoding"

gzip does not seem to like it either:

> zcat body.gz
...
gzip: body.gz: unexpected end of file

but that also suggesta that the unexpected EOF I got using the zlib feature is just the way it says corrupt deflate stream...

adamreichold avatar Feb 05 '21 20:02 adamreichold

From reading into cURL's source, my initial guess would be that its handling of expected but ignored trailer bytes in https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L135 might make the difference...

adamreichold avatar Feb 05 '21 20:02 adamreichold

And cURL's seems ignore an error condition which I do not understand yet: https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L221

adamreichold avatar Feb 05 '21 20:02 adamreichold

I wonder if we could bypass this issue by simply sending Accept-Encoding: gzip instead of Accept-Encoding: gzip, deflate. In practice I think deflate is almost never used by websites. And chances are that gzip is supported if deflate also is. I think reqwest only supports gzip as well.

Worst case, gzip is not supported and the content is sent in plain.

sbstp avatar Feb 06 '21 15:02 sbstp

I think we might be running into these kind of errors.

sbstp avatar Feb 06 '21 15:02 sbstp

I've also seen 20 "invalid gzip header" errors in the top 1M. Here's the data: invalid-gzip-header.tar.gz

That does sound like the issues with "deflate" encoding that the article talks about.

Shnatsel avatar Feb 06 '21 15:02 Shnatsel

I wonder if we could bypass this issue by simply sending Accept-Encoding: gzip instead of Accept-Encoding: gzip, deflate.

I do not yet understand is how this relates to my tests against http://landolts.com: The server is nginx, i.e. not a Microsoft implementation, the headers indicate that the result is gzip-encoded. Do you think the header is incorrect and this is a deflate-stream nonetheless?

The error message comes from https://github.com/rust-lang/flate2-rs/blob/90d9e5ed866742ce8b3946d156830e300d1e5aab/src/zio.rs#L152 and this code is generic w.r.t. to gzip or deflate headers, so I don't think it refers to the actual format in use.

adamreichold avatar Feb 06 '21 18:02 adamreichold

I tried playing with the accept endoing header that we send to landolts.com, and the error occurs if we have gzip in the accepted encodings, but not deflate or identity. So it seems like their server configuration might be broken, the gzip they are sending is not really gzip.

sbstp avatar Feb 06 '21 20:02 sbstp

the gzip they are sending is not really gzip.

While I agree in principle, the observation that both cURL and Firefox are able to handle this suggests there are workarounds. Especially, even us and flate2 basically decompress everything and only fail at EOF. Judging from the cURL code, there is quite a bit of variability of how gzip is implemented in the wild.

adamreichold avatar Feb 06 '21 20:02 adamreichold

For what it's worth, I did a test with reqwest, and it seems like it also has this problem. It would be neat to get to the bottom of this and fix it across the ecosystem.

use reqwest::blocking::Client;

fn main() -> Result<(), reqwest::Error> {
    let client = Client::new();
    let req = client
        .get("http://landolts.com")
        .header("Accept-Encoding", "gzip")
        .build()?;
    println!("{:?}", req.headers());
    let resp = client.execute(req)?;
    println!("{}", resp.text()?);
    Ok(())
}
{"accept-encoding": "gzip"}
Error: reqwest::Error { kind: Decode, source: Custom { kind: UnexpectedEof, error: "unexpected end of file" } }

sbstp avatar Feb 06 '21 20:02 sbstp

I think we might be able to find some information on this stack overflow answer by Mark Adler.

sbstp avatar Feb 06 '21 21:02 sbstp

I have the same test code implemented for 4 clients and growing in https://github.com/Shnatsel/rust-http-clients-smoke-test, it might come in handy for comparing behavior between clients.

Shnatsel avatar Feb 06 '21 21:02 Shnatsel

My current guess is that flate2 expects the stream to end as described in https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L344, i.e. with a CRC and a size field, whereas cURL tries to read the trailer, but only errs if there is extra data, not if part of the trailer is missing: https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L135

But admittedly, I am not very confident in my reading of the cURL code. But at least, missing CRC and size information would explain why the body is completely decompressed and only then an error is raised. It would also make sense to e.g. give flate2 a flag that make its processing more lenient w.r.t. this redundant information.

adamreichold avatar Feb 06 '21 21:02 adamreichold

Golang's http library has this issue as well. Looks like curl is one of the few places that figured it out.

sbstp avatar Feb 07 '21 05:02 sbstp