colly Interoperability issues with broken gzip responses

If a server is broken and returns an invalid or missing content-length, colly reports "unexpected EOF", despite having received data.

Repro:

use colly to visit https://regexr.com/39p0t
res.ContentLength is reported as -1
EOF error is returned from httpBackend.Do, even though body has all the content.

(This is an ErrUnexpectedEOF, not a regular EOF, which is why ioutil.ReadAll returns it.)

I would expect colly to return the Response in this case.

Jun 15 '20 12:06 majelbstoat

This is not caused by missing Content-Length. Missing Content-Length itself is perfectly fine, since we have chunked TE and HTTP/2 framing to report EOF properly.

This is caused by broken gzip-compressed response returned by the server:

curl -s https://regexr.com/39p0t --header 'Accept-Encoding: gzip' > /tmp/foo.gz; file /tmp/foo.gz
/tmp/foo.gz: gzip compressed data, from Unix, original size modulo 2^32 4294901760 gzip compressed data, reserved method, ASCII, has CRC, extra field, has comment, encrypted, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 4294901760

The server returns a gzip file with 0xffff0000 original size in its header, which is obviously wrong. Go's gzip decompressor keeps track of the advertised size, and returns io.ErrUnexpectedEOF when data ends too early.

I assume most HTTP clients ignore gzip header, and that's why it works fine in browsers.

Jun 15 '20 13:06 WGH-

Oh, huh. Well, I'm running a patched version with this PR applied, and it fixes it, though perhaps for the wrong reason?

I checked and it wasn't going into the gzip reader block to read the response, but perhaps that's something different.

Jun 15 '20 13:06 majelbstoat

Well, it fixes it because it suppresses the error, no surprise here.

Jun 15 '20 13:06 WGH-

ok, so confirmed it's still broken with a vanilla client, so fair enough.

request, err := http.NewRequest("GET", "https://regexr.com/39p0t", nil)
response, _ := client.Do(request)
defer response.Body.Close()
b, err := ioutil.ReadAll(response.Body)
if err != nil {
	fmt.Printf("%s\n, err.Error())
}

So, is there any kind of workaround? On balance, I'd rather have data when it's there, even if the server is doing the wrong thing 🤷🏼‍♂️

Adding

c.OnRequest(func(r *colly.Request) {
	r.Headers.Set("Accept-Encoding", "gzip")
	// ... and/or
	r.Headers.Set("Accept", "gzip")
})

doesn't help. Setting ParseHTTPErrorResponse doesn't work, probably because no response is returned to parse.

I don't imagine I'm the first person to have encountered this, but I couldn't see any issues or documentation to help handle it.

Jun 15 '20 14:06 majelbstoat

I don't imagine I'm the first person to have encountered this, but I couldn't see any issues or documentation to help handle it.

I'd guess there aren't that many servers out there that return broken gzips, and people running mass crawls don't analyze the logs with scrutiny. I checked my own logs, and indeed, I've also came across such sites, but didn't know about that until now.

Jun 15 '20 14:06 WGH-

Sorry, I was somewhat wrong: the problem is not in the gzip footer. The compressed deflate stream itself is truncated.

Here's the stack trace when the problem occurs (I patched panic into compress/flate/inflate.go).

compress/flate.(*decompressor).moreBits(0xc0003bb300, 0xc000495d50, 0xc000495e20)
	/usr/lib/go/src/compress/flate/inflate.go:698 +0x93
compress/flate.(*decompressor).nextBlock(0xc0003bb300)
	/usr/lib/go/src/compress/flate/inflate.go:303 +0x36
compress/flate.(*decompressor).Read(0xc0003bb300, 0xc0001a6c00, 0x400, 0x400, 0x14, 0xbfb201a7047502fc, 0x39725978)
	/usr/lib/go/src/compress/flate/inflate.go:347 +0x77
compress/gzip.(*Reader).Read(0xc0000cc840, 0xc0001a6c00, 0x400, 0x400, 0x66b8a0, 0x1, 0xc0001a6c00)
	/usr/lib/go/src/compress/gzip/gunzip.go:251 +0x87
net/http.(*http2gzipReader).Read(0xc0004c6990, 0xc0001a6c00, 0x400, 0x400, 0x2, 0x0, 0x0)
	/usr/lib/go/src/net/http/h2_bundle.go:8917 +0x69

Jun 15 '20 15:06 WGH-

If it were just bad footer, gzip.ErrChecksum would be returned, which is easy to distinguish.

Jun 15 '20 15:06 WGH-

What still bothers me is that neither browsers nor curl (curl --compressed https://regexr.com/39p0t) have problems with truncated stream (curl reports successful exit code). Browsers might be the less interesting case, as IIRC they will always show the truncated concent, but I need to understand why curl is ok with that before suggesting some concrete fix for colly.

Jun 15 '20 15:06 WGH-

Yeah, I was wondering the same. Couldn't find anything interesting in curl's trace output, and I'm at the limits of my knowledge here, but I appreciate you taking the time to look at it.

Jun 15 '20 15:06 majelbstoat

curl ignores the error here for some reason: https://github.com/curl/curl/blob/8df455479f8801bbebad8839fc96abbffa711603/lib/content_encoding.c#L222-L224

Jun 15 '20 16:06 WGH-

Interesting issue. I think we should handle this issue similarly to how the major browsers/http clients handle it.

Jun 15 '20 20:06 asciimoo

I've just encountered another interesting case:

$ curl <redacted> --header 'Accept-Encoding: gzip' -s | gunzip >/dev/null
gzip: stdin: decompression OK, trailing garbage ignore

$ curl <redacted> --compressed
...
        </body>
curl: (23) Failed writing received data to disk/application
</html>

Go returns gzip: invalid header on this site.

Dec 23 '21 15:12 WGH-

colly colly copied to clipboard

Interoperability issues with broken gzip responses

colly
colly copied to clipboard