go icon indicating copy to clipboard operation
go copied to clipboard

proposal: compress/gzip: return different error for trailing garbage

Open nikaiw opened this issue 2 years ago • 5 comments

Hello, Regarding #47809 , would you consider returning a different err ? "gzip: invalid header" can be quite misleading when the error is actually because of trailing garbage.

nikaiw avatar Aug 07 '23 03:08 nikaiw

CC @dsnet

ianlancetaylor avatar Aug 07 '23 05:08 ianlancetaylor

According to RFC 1952, section 2.2:

A gzip file consists of a series of "members" (compressed data sets).

Thus, when a gzip member terminates, the parser should correctly check whether there is a header for another gzip member. An "invalid header" error seems to be the right error given this grammar.

dsnet avatar Aug 07 '23 22:08 dsnet

In addition to what @dsnet said, I think it's possible to use the current API to figure out whether there might be trailing data. If z is a gzip.Reader, you can call z.Multistream(false) to have the reader read a single member at a time. Then:

  • If you know that your data always consists of a single data stream, you can read it to the end. If there's still more data available to read after that, you know that there is trailing junk.
  • If your data may consist of multiple concatenated streams, then it isn't really possible to disambiguate corrupted (say, truncated) data from trailing junk. But you could still read as many data streams as possible and then, once you hit an error, decide whether you think you've got trailing junk or a corrupted chunk, perhaps based on what the decompressed data looks like or something.

(The general ambiguity described in the second bullet is just what @dsnet pointed out; it's why the compress/gzip package isn't in a position to know whether there is trailing junk or a different kind of corruption. But you may be able to tell based on what your data looks like.)

cespare avatar Aug 07 '23 22:08 cespare

Thanks for the details, that is making sense and I don't want to complicate the code for nothing. I must say I would still love to have a distinguished error that would allow a developper to quickly be able to make the difference between an error regarding the first gzip header and the attempt at reading another gzip member from a multistream gzip.

I imagine the value of the error could be changed to reflect that just after we manage to read the first member.

https://github.com/golang/go/blob/24f83ed4e29495d5b8b6375aeaa2d34d14629c7d/src/compress/gzip/gunzip.go#L273

What are your thoughts?

Edit: Alternatively, if multistream is set to true, the error could be read as something such as "Invalid header in multi-stream gzip"

nikaiw avatar Aug 08 '23 12:08 nikaiw

According to RFC 1952, section 2.2:

A gzip file consists of a series of "members" (compressed data sets).

Thus, when a gzip member terminates, the parser should correctly check whether there is a header for another gzip member. An "invalid header" error seems to be the right error given this grammar.

@dsnet While the error is technically correct, I agree with @nikaiw that a having a distinct error would provide tremendous clarity. Something like invalid header in multistream gzip would make the cause much clearer, and make the solution much easier to find.

To quote my rationale from #67986[^1]:

When using the compress/gzip package to decompress gzip files, receiving a gzip: invalid header error can indicate two distinct possibilities.

  1. The file is not a valid gzip file.

    Example test case (click to expand)
    import (
      "compress/gzip"
      "io"
      "os"
      "testing"
    )
    
    func TestNotAValidGzipFile(t *testing.T) {
      f, err := os.Open("/tmp/python-3.12.3-amd64.exe") // Arbitrary example, any non-gzip file will suffice.
      if err != nil {
        t.Fatal(err)
      }
      defer f.Close()
    
      rc, err := gzip.NewReader(f)
      if err != nil {
        t.Fatal(err)
      }
      defer rc.Close()
    
      _, err = io.ReadAll(rc)
      if err != nil {
        t.Fatal(err)
      }
    }
    
    // === RUN   TestNotAValidGzipFile
    //  gzip_test.go:19: gzip: invalid header
    
  2. The file is a valid gzip file, but contains invalid trailing data.

    Example test case (click to expand)
    func TestValidGzipFileWithTrailingData(t *testing.T) {
      // Reproducer file. There are many examples of this.
      // https://github.com/udacity/self-driving-car-sim/blob/4b1f739ebda9ed4920fe895ee3677bd4ccb79218/Assets/Standard%20Assets/Environment/SpeedTree/Conifer/Conifer_Desktop.spm
      f, err := os.Open("/tmp/Conifer_Desktop.spm")
      if err != nil {
        t.Fatal(err)
      }
      defer f.Close()
    
      rc, err := gzip.NewReader(f)
      if err != nil {
        t.Fatal(err)
      }
      defer rc.Close()
    
      _, err = io.ReadAll(rc)
      if err != nil {
        t.Fatal(err)
      }
    }
    
    // === RUN   TestValidGzipFileWithTrailingData
    //  gzip_test.go:19: gzip: invalid header
    

Scenario 2 can be especially confusing because Go's implementation of compress/gzip rejects invalid trailing data, while many popular applications and languages do not. Hence, the ambiguity of this error can lead people to believe that Go is rejecting a valid gzip file.

[^1]: GitHub's search failed me when looking for an existing issue. I'm guessing I should close that in favour of this?

rgmz avatar Jun 14 '24 12:06 rgmz