proposal: compress/gzip: return different error for trailing garbage
Hello, Regarding #47809 , would you consider returning a different err ? "gzip: invalid header" can be quite misleading when the error is actually because of trailing garbage.
CC @dsnet
According to RFC 1952, section 2.2:
A gzip file consists of a series of "members" (compressed data sets).
Thus, when a gzip member terminates, the parser should correctly check whether there is a header for another gzip member. An "invalid header" error seems to be the right error given this grammar.
In addition to what @dsnet said, I think it's possible to use the current API to figure out whether there might be trailing data. If z is a gzip.Reader, you can call z.Multistream(false) to have the reader read a single member at a time. Then:
- If you know that your data always consists of a single data stream, you can read it to the end. If there's still more data available to read after that, you know that there is trailing junk.
- If your data may consist of multiple concatenated streams, then it isn't really possible to disambiguate corrupted (say, truncated) data from trailing junk. But you could still read as many data streams as possible and then, once you hit an error, decide whether you think you've got trailing junk or a corrupted chunk, perhaps based on what the decompressed data looks like or something.
(The general ambiguity described in the second bullet is just what @dsnet pointed out; it's why the compress/gzip package isn't in a position to know whether there is trailing junk or a different kind of corruption. But you may be able to tell based on what your data looks like.)
Thanks for the details, that is making sense and I don't want to complicate the code for nothing. I must say I would still love to have a distinguished error that would allow a developper to quickly be able to make the difference between an error regarding the first gzip header and the attempt at reading another gzip member from a multistream gzip.
I imagine the value of the error could be changed to reflect that just after we manage to read the first member.
https://github.com/golang/go/blob/24f83ed4e29495d5b8b6375aeaa2d34d14629c7d/src/compress/gzip/gunzip.go#L273
What are your thoughts?
Edit: Alternatively, if multistream is set to true, the error could be read as something such as "Invalid header in multi-stream gzip"
According to RFC 1952, section 2.2:
A gzip file consists of a series of "members" (compressed data sets).
Thus, when a gzip member terminates, the parser should correctly check whether there is a header for another gzip member. An "invalid header" error seems to be the right error given this grammar.
@dsnet While the error is technically correct, I agree with @nikaiw that a having a distinct error would provide tremendous clarity. Something like invalid header in multistream gzip would make the cause much clearer, and make the solution much easier to find.
To quote my rationale from #67986[^1]:
When using the
compress/gzippackage to decompress gzip files, receiving agzip: invalid headererror can indicate two distinct possibilities.
The file is not a valid gzip file.
Example test case (click to expand)
import ( "compress/gzip" "io" "os" "testing" ) func TestNotAValidGzipFile(t *testing.T) { f, err := os.Open("/tmp/python-3.12.3-amd64.exe") // Arbitrary example, any non-gzip file will suffice. if err != nil { t.Fatal(err) } defer f.Close() rc, err := gzip.NewReader(f) if err != nil { t.Fatal(err) } defer rc.Close() _, err = io.ReadAll(rc) if err != nil { t.Fatal(err) } } // === RUN TestNotAValidGzipFile // gzip_test.go:19: gzip: invalid headerThe file is a valid gzip file, but contains invalid trailing data.
Example test case (click to expand)
func TestValidGzipFileWithTrailingData(t *testing.T) { // Reproducer file. There are many examples of this. // https://github.com/udacity/self-driving-car-sim/blob/4b1f739ebda9ed4920fe895ee3677bd4ccb79218/Assets/Standard%20Assets/Environment/SpeedTree/Conifer/Conifer_Desktop.spm f, err := os.Open("/tmp/Conifer_Desktop.spm") if err != nil { t.Fatal(err) } defer f.Close() rc, err := gzip.NewReader(f) if err != nil { t.Fatal(err) } defer rc.Close() _, err = io.ReadAll(rc) if err != nil { t.Fatal(err) } } // === RUN TestValidGzipFileWithTrailingData // gzip_test.go:19: gzip: invalid headerScenario 2 can be especially confusing because Go's implementation of
compress/gziprejects invalid trailing data, while many popular applications and languages do not. Hence, the ambiguity of this error can lead people to believe that Go is rejecting a valid gzip file.
[^1]: GitHub's search failed me when looking for an existing issue. I'm guessing I should close that in favour of this?