jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8322256: Define and document GZIPInputStream concatenated stream semantics

Open archiecobbs opened this issue 11 months ago • 9 comments

GZIPInputStream supports reading data from multiple concatenated GZIP data streams since JDK-4691425. In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an IOException, it just ignores the trailing garbage and/or the IOException and returns EOF.

There are several issues with this:

  1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
  2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a ZipException will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
  3. There's no way to create a GZIPInputStream that does not support stream concatenation.

On the other hand, GZIPInputStream is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in JDK-7036144).

So this patch adds a new constructor that takes two new parameters allowConcatenation and allowTrailingGarbage. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when allowTrailingGarbage is false, then the underlying input must contain exactly one (if allowConcatenation is false) or exactly N (if allowConcatenation is true) concatenated GZIP data streams, otherwise an exception is guaranteed.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue
  • [ ] Change requires a CSR request matching fixVersion 23 to be approved (needs to be created)

Issue

  • JDK-8322256: Define and document GZIPInputStream concatenated stream semantics (Bug - P5)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18385/head:pull/18385
$ git checkout pull/18385

Update a local copy of the PR:
$ git checkout pull/18385
$ git pull https://git.openjdk.org/jdk.git pull/18385/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18385

View PR using the GUI difftool:
$ git pr show -t 18385

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18385.diff

Webrev

Link to Webrev Comment

archiecobbs avatar Mar 19 '24 21:03 archiecobbs

/csr

archiecobbs avatar Mar 19 '24 21:03 archiecobbs

:wave: Welcome back acobbs! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Mar 19 '24 21:03 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar Mar 19 '24 21:03 openjdk[bot]

@archiecobbs has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@archiecobbs please create a CSR request for issue JDK-8322256 with the correct fix version. This pull request cannot be integrated until the CSR request is approved.

openjdk[bot] avatar Mar 19 '24 21:03 openjdk[bot]

@archiecobbs The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Mar 19 '24 21:03 openjdk[bot]

@archiecobbs This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar May 13 '24 22:05 bridgekeeper[bot]

/pingbot

archiecobbs avatar May 14 '24 01:05 archiecobbs

@archiecobbs Unknown command pingbot - for a list of valid commands use /help.

openjdk[bot] avatar May 14 '24 01:05 openjdk[bot]

@archiecobbs this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8322256
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Jun 06 '24 16:06 openjdk[bot]

@archiecobbs This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Jul 04 '24 17:07 bridgekeeper[bot]

/backoffbot

archiecobbs avatar Jul 04 '24 18:07 archiecobbs

@archiecobbs Unknown command backoffbot - for a list of valid commands use /help.

openjdk[bot] avatar Jul 04 '24 18:07 openjdk[bot]

Hello Archie, sorry it has taken long to review this PR.

The proposal here is to have java.util.zip.GZIPInputStream specify (and improve) how it deals with concatenated GZIP stream(s) and also allow for applications instantiating GZIPInputStream to decide whether or not the underlying InputStream passed to the GZIPInputStream instance is expected to contain a concatenated GZIP stream(s).

I'll include here the text that I had sent in a private communication to Archie:

I think this comes down to introducing a new optional boolean parameter to the constructor of GZIPInputStream.

The boolean when "true" will attempt to read a potential additional GZIP stream after the previous GZIP stream's trailer has been read. In that case we need to handle 2 cases - one where we successfully find the header of the next (concatenated) GZIP stream and the other where we don't find a valid GZIP stream header. In the first case where we do find a header successfully, then we continue reading the stream as usual and return the uncompressed data from the "read()" call. In the case where we fail to find a valid header and yet there were bytes past the previous GZIP stream, then I think the GZIPInputStream.read() should throw an IOException, since that stream no longer represents a GZIP stream (remember, we are talking about this only when the GZIPInputStream was constructed with the new boolean parameter = true).

Coming to the case where the GZIPInputStream was constructed using boolean = false - In this case when we reach and read the trailer of a GZIP stream, if there is no more bytes, then we consider this a completed GZIP stream and return the uncompressed data. If there is any more bytes past the GZIP stream's trailer, then I think we should throw a IOException (since we aren't expecting any concatenated GZIP stream).

As for the default value of this optional boolean parameter, I think it should default to "true" implying it will read any concatenated GZIP streams. That would match the current implementation of GZIPInputStream.read() which has the ability to read (valid) GZIP concatenated input stream.

I think this would then allow us to keep the implementation simple as well as allow the calling application to have control over whether or not the passed should be considered as having a concatenated GZIP stream.

jaikiran avatar Jul 15 '24 10:07 jaikiran

I'm late to the table for this PR, but could we stop a moment to consider whether boolean constructor arguments is the best way to represent and compose optional parser configuration?

Boolean options by themselves can be tricky in that it requires knowing/understanding what the negative, positive and default is. Multiple constructor arguments of the same type makes it easy to mistake the order of the options while writing code. Reading the constructor with new GZIPInputStream(in, size, true, true), forces you to read the constructor documentation, figure out which of the boolean options is which, then what they represent. What if someone wants to introduce other options in the future, how will that scale?

I'm not sure what a better alternative would be, but I'm sure we must have prior art in parser configuration in the JDK?

Perhaps enums for the options, or composing constant int fields would be an improvement?

Again, not sure what the optimal API solution is here, just wanted to raise the concern :-)

eirbjo avatar Jul 19 '24 09:07 eirbjo

This PR is open, but notifications don't seem to reach the mailing lists. I experienced a similar situation a while back where I believe @jaikiran reached out to the Skara team to get that PR out of the silent state.

eirbjo avatar Jul 19 '24 10:07 eirbjo

First thank you for your efforts and patience with this PR Archie.

I am trying to better grasp the problem to be solved outside of better clarity surrounding how GZIPInputStream handles concatenated gzip files

  • I could see possibly providing an enum which indicates to stop decompressing after processing the first gzip file or to decompress until the end of input.
  • This would be similar to what the commons-compress GzipCompressorInputStream class does.
  • It is also what Eirik and Jai are suggesting I believe.
  • If support for processing multiple gzip files is enabled, which would be the default, then the existing behavior is not changed.
  • As I have mentioned previously, we should add more testing including taking a couple of concatenated examples created via the gzip tool (or equivalent) give how light our coverage is.

I don't believe we want to do more than is needed given the age of the API and there is been minimal requests for improvements at this time.

LanceAndersen avatar Jul 26 '24 13:07 LanceAndersen

Hi @LanceAndersen,

Thanks for taking a look. Let's achieve consensus on what the goal(s) should be here, and then I think a proper API for that will come naturally. Up until now the discussion has been a shifting blend of API design and API goals which makes things a little bit chaotic (probably my fault).

Proposed Goal №1:

This one seems relatively non-controversial: Allow creation of a GZIPInputStream that only decompresses one GZIP stream.

OK but what does that mean exactly?

  • Option A: The GZIPInputStream will never read past the end of the GZIP trailer frame
    • Sounds great, but how exactly would that work?
      • Option A.1: Caller must provide a BufferedInputStream, so we can back it up if any extra bytes have been read past the trailer frame
      • Option A.2: GZIPInputStream provides a way for caller to access the extra bytes read once EOF is detected
      • Option A.3: GZIPInputStream never buffers more than 8 bytes (size of trailer frame)
  • Option B: GZIPInputStream attempts to read the byte after the trailer frame and throws IOException if EOF is not returned

My opinion: All of the A options are unsatisfactory (although A.1 is the least bad), so it seems option B is the answer here.

Proposed Goal №2:

This one seems more controversial: Allow the caller to configure GZIPInputStream for "strict mode".

Just to be clear and restate what this is referring to. Currently, GZIPInputStream is hard-wired for "lenient mode". This means that it is indeterminate (a) how many additional bytes (if any) were read beyond the GZIP trailer frame, and (b) whether reading stopped due to EOF, invalid data, or an underlyingIOException.

My opinion: this is, at best, a random intermittent bug - and at worst a giant security hole - waiting to happen.

Reasoning: If the caller is OK with lenient mode, then the caller obviously doesn't (or shouldn't) care about the data that follows the input stream (if any), because the caller won't even know whether there was any, how much of it has been discarded (if any), or whether it was actually another valid GZIP stream that got corrupted or threw anIOException.

So why would any caller actually want to use this feature to concatenate anything useful after a GZIP trailer frame?

In particular: if an underlying IOException occurs at just the wrong time, an application that wrote GZIP-STREAM-1, GZIP-STREAM-2, GZIP-STREAM-3 could read back GZIP-STREAM-1, GZIP-STREAM-2 and never know that anything was wrong.

How useful is concatenated GZIP stream support if it's not even reliable??

This just seems completely wonky to me (and oh, did I mention it's a potential security hole waiting to happen? :)

It seems like if concatenation is going to be a supported feature, then it should be possible to access a reliable version of it. In other words, there should be a way for an application to say "I want to decode exactly ONE (or exactly N) whole GZIP streams or else guarantee me an exception".

Maybe the answer is simply to tell people "Don't use this class, it's not reliable". I'd be OK with that I guess but it seems a bit of a cop out.

archiecobbs avatar Jul 26 '24 15:07 archiecobbs

Hello Archie, I don't think we need any enum in the API. GZIPInputStream class should only allow parsing GZIP streams. The InputStream passed to the GZIPInputStream constructor must be a InputStream which has either 1 GZIP stream or N GZIP streams (concatenated). The user should be allowed to state whether the passed InputStream is expected contain more than 1 GZIP stream. This is the only addition that I think we should be introducing.

The way to allow users to state that the InputStream being passed is allowed to have more than 1 GZIP stream is through the new proposed boolean to the constructor. The boolean can be named "allowConcatenated".

When allowConcatenated is true, then during the read() operation on the GZIPInputStream, when we internally read a GZIP stream trailer, we should check if the underlying InputStream has a next GZIP stream header: - If there's no bytes present after the trailer, we just return whatever data we have inflated so far and that also would be the end-of-stream of the GZIPInputStream. - If there's bytes after the trailer, then those next bytes MUST be the next GZIP stream header. If those bytes aren't a GZIP stream header, then we throw an IOException. This implies whatever data we had deflated and any additional bytes that we had read after the trailer will not be returned to the user. That's fine because the underlying InputStream was "corrupt" - the GZIPInputStream shouldn't expect any other data other than GZIP stream(s) in the underlying InputStream.

When allowConcatenated is false, then during the read() operation on the GZIPInputStream, when we internally read a GZIP stream trailer, we should check if the underlying InputStream has any more bytes remaining: - If there are bytes remaining then we consider that an error and raise an IOException. We do this because we were instructed not to allow concatenated GZIP streams. So presence of any data after the first GZIP stream trailer should be an exception. - If there are no more bytes remaining then we have reached the end-of-stream and we return back whatever GZIP data we have inflated during the read operation and that also would be the end-of-stream of the GZIPStream instance.

The default value for allowConcatenated should be true to allow for supporting concatenated GZIP streams (like we do now). There however will be a change in behaviour if the stream content after the trailer isn't a GZIP stream. With this new proposed change we will throw an IOException (unlike previously). I think that's fine and in fact the correct thing to do. It of course will need a CSR and a release note to note this change.

jaikiran avatar Jul 26 '24 16:07 jaikiran

Hi @jaikiran,

There however will be a change in behaviour if the stream content after the trailer isn't a GZIP stream. With this new proposed change we will throw an IOException (unlike previously). I think that's fine and in fact the correct thing to do.

Ah! This simplifies everything. I was just assuming that preserving the existing behavior was a hard requirement - but if not, we can make this class "always precise", and the only thing left to configure is whether to allow 1 stream or N streams. I am in complete agreement with you now, especially because this eliminates the aforementioned security concern.

I'm assuming you and @LanceAndersen have communicated and he is OK with this as well. I will update the PR and CSR.

Thanks!

archiecobbs avatar Jul 26 '24 16:07 archiecobbs

With this new proposed change we will throw an IOException (unlike previously). I think that's fine and in fact the correct thing to do.

I've run into a problem.

With this change, GZIPInZip.java fails; the test is annotated like this:

/* @test
 * @bug 7021870 8023431 8026756
 * @summary Reading last gzip chain member must not close the input stream.
 *          Garbage following gzip entry must be ignored.
 */

Note the "Garbage following gzip entry must be ignored" line. This must be what bugs 8023431 8026756 refer to, but those bugs are not accessible ("You can't view this issue. It may have been deleted or you don't have permission to view it.") so I can't see what the problem was that required ignoring the extra garbage. Maybe you guys have access or know what this was about? Thanks. See also this stackoverflow question - it seems to be a thing with ZIP files.

archiecobbs avatar Jul 27 '24 02:07 archiecobbs

With this new proposed change we will throw an IOException (unlike previously). I think that's fine and in fact the correct thing to do.

I've run into a problem.

With this change, GZIPInZip.java fails; the test is annotated like this:

/* @test
 * @bug 7021870 8023431 8026756
 * @summary Reading last gzip chain member must not close the input stream.
 *          Garbage following gzip entry must be ignored.
 */

Note the "Garbage following gzip entry must be ignored" line. This must be what bugs 8023431 8026756 refer to, but those bugs are not accessible ("You can't view this issue. It may have been deleted or you don't have permission to view it.") so I can't see what the problem was that required ignoring the extra garbage. Maybe you guys have access or know what this was about? Thanks. See also this stackoverflow question - it seems to be a thing with ZIP files.

I don't know the full history, but JDK-4691425 added support for concatenated gzip files and also the code for ignoring extra trailing garbage bytes.

As part of the fix for JDK-8026756 a test was added for validating that trailing garbage bytes after the end header are ignored, but was not mentioned in the bug.

In gzip manual:

6 Using gzip on tapes

When writing compressed data to a tape, it is generally necessary to pad the output with zeroes up to a block > boundary. When the data is read and the whole block is passed to gunzip for decompression, gunzip detects that there is extra trailing garbage after the compressed data and emits a warning by default if the garbage contains nonzero bytes. You can use the --quiet option to suppress the warning.

Based on the above and the test that was added via JDK-8026756, the existing behavior for the handling of extra garbage bytes should be left as is.

So where does that leave us:

  • Keep the code as is and document the current behavior
  • Continue to add additional test coverage for the current API
  • We probably do not need a new constructor given it probably adds no new additional value the existing API

LanceAndersen avatar Jul 29 '24 13:07 LanceAndersen

So where does that leave us:

Keep the code as is and document the current behavior Continue to add additional test coverage for the current API We probably do not need a new constructor given it probably adds no new additional value the existing API

Just so I understand... are you saying that you are OK with this class being fundamentally unreliable and providing no way to avoid that unreliability? (Just to restate a previous example of this: if an underlying IOException occurs at just the wrong time, an application that wrote GZIP-STREAM-1, GZIP-STREAM-2, GZIP-STREAM-3 could read back GZIP-STREAM-1, GZIP-STREAM-2 and never know that anything was wrong.)

If that's not what you're saying, then I don't the understand "We probably do not need a new constructor" part.

archiecobbs avatar Jul 29 '24 13:07 archiecobbs

So where does that leave us: Keep the code as is and document the current behavior Continue to add additional test coverage for the current API We probably do not need a new constructor given it probably adds no new additional value the existing API

Just so I understand... are you saying that you are OK with this class being fundamentally unreliable and providing no way to avoid that unreliability? (Just to restate a previous example of this: if an underlying IOException occurs at just the wrong time, an application that wrote GZIP-STREAM-1, GZIP-STREAM-2, GZIP-STREAM-3 could read back GZIP-STREAM-1, GZIP-STREAM-2 and never know that anything was wrong.)

If that's not what you're saying, then I don't the understand "We probably do not need a new constructor" part.

What I am suggesting is before we move forward, we really need:

  • More test coverage
  • An understanding of what other gzip API and tools(commons-compress, gzip, python, etc...) might do (or do not do) with the problem you are trying to address

The Java ZIP APIs date back to the early days of Java (1.1/1.2) and changes(fixes) often can cause unexpected regressions so we have to thread lightly.

I am not opposed to making the API more robust, we really need more datapoint to better understand the risks/benefits in order to make an informed decision.

LanceAndersen avatar Jul 29 '24 14:07 LanceAndersen

... we really need more datapoint to better understand the risks/benefits in order to make an informed decision.

Agreed.

Here's some what I've come up with after a little bit of research:

First, we shouldn't confuse GZIP with ZIP files. They are only indirectly related:

  • The DEFLATE format is a single file compression format
  • The GZIP format is a wrapper around the DEFLATE format.
  • Individual entries in a ZIP archive is typically DEFLATE'd, but never GZIP'd

The "overlap" that is relevant here is simply when the file you are storing in a ZIP file happens to itself be a GZIP file e.g. foo.gz (so yes, you are likely compressing foo twice).

Specifically, the failing test GZIPInZip.java is doing this:

  1. Create a normal GZIP data file b.gz
  2. Create a normal ZIP file
  3. Write b.gz plus one extra byte into the ZIP file
  4. Close the ZIP file
  5. Open the ZIP file
  6. Find the InputStream corresponding to the entry for b.gz
  7. Read and decompress it with a GZIPInputStream
  8. Verify the extra byte was ignored

So this has nothing to do with the ZIP file format itself, rather it's related to the behavior of how other software might build ZIP files.

Nor does it directly relate to the GZIP format or how any GZIP decoder (taken in isolation) should behave.

So what I don't understand is what is the motivation for the behavior this test is verifying? It looks like, at best, a bug workaround for some unknown other broken software.

Here's the commit it was added:

commit 71f33254816a17ff741e0119e16db28181d7b43b
Author: Ivan Gerasimov <[email protected]>
Date:   Tue Oct 15 21:15:17 2013 +0400

    8023431: Test java/util/zip/GZIP/GZIPInZip.java failed
    
    Properly close PipedStreams. Additional testing for malformed input
    
    Reviewed-by: darcy, sherman

Did you double check the unviewable issues? What does JDK-8023431 say?

I also took a look at how Apache's commons-compress GzipCompressorInputStream behaves.

GzipCompressorInputStream is the commons-compress version of GZIPInputStream. From what I can infer, part of its original motivation was that pre-JDK7 GZIPInputStream didn't support concatenated streams. This class provides an optional decompressConcatenated boolean constructor parameter (default false).

So how does GzipCompressorInputStream behave...?

  • IOException's thrown by the underlying input stream are never suppressed; this includes IOException's thrown while reading "extra garbage" that may follow the first (if decompressConcatenated = false) or any (if decompressConcatenated = true) GZIP trailer frame.
  • When decompressConcatenated = false, after reading the first GZIP trailer frame, the underlying input is reset() if possible to the stream's boundary. If reset() is not supported, some indeterminate number of extra bytes will have been read (note, this exception is missing from the Javadoc).
  • When decompressConcatenated = true, the entire input stream is read, and it must contain only whole, valid GZIP streams with no trailing garbage.

Summary: GzipCompressorInputStream is never "lenient"; it either reads the whole stream (if decompressConcatenated = true) or else just the first GZIP stream and stops as soon as it can thereafter (exactly if markSupported(), othewise inexactly).

Side note: the reliance on mark()/reset() is a nice trick so to speak: if you can provide an input stream that supports it, you get precision in return, and this is accomplished without requiring any new API surface.

This brings up the question, how would commons-compress behave in the scenario being tested by the GZIPInZip.java test?

Answer: Let's assume you replace new GZIPInputStream(zis) with new GzipCompressorInputStream(zis) in that test. Then the test would succeed, because:

  • By default, GzipCompressorInputStream sets decompressConcatenated = false.
  • The extra byte might get read, but because decompressConcatenated = false decompression stops after the last byte of b.gz

Overall the API and behavior design of GzipCompressorInputStream seems like a good one. I would even recommend that we adopt it, except that it's incompatible with the current behavior (we assume decompressConcatenated = true by default).

So here's a stab at what we might do (opening debate): Basically, we replicate the GzipCompressorInputStream behavior with the exception that our allowConcatenated defaults to true to preserve existing behavior. In summary:

  • We utilize mark()/reset() to guarantee precise stopping behavior when allowConcatenated = false - but only if markSupported() returns true; otherwise, best effort/imprecise stop that may read past the GZIP end.
  • We eliminate all "lenient" behavior - e.g., underlying IOException's are never suppressed, and if concatenation is enabled, non-empty GZIP header frames must always be valid and the entire stream is consumed.
  • Update GZIPInZip.java to replace new GZIPInputStream(zis) with new GZIPInputStream(zis, false) to unbreak it

archiecobbs avatar Jul 29 '24 17:07 archiecobbs

Consider the following simple gzip test

gz % cat batman.txt 
I am batman
gz % cat hello.txt 
hello
 % cat robin.txt 
Robin Here
gz  % cat Joker.txt                          
I am the Joker

gz % gzip -c batman.txt hello.txt >> bats.gz
gz % gunzip -c bats                         
I am batman
hello

gz % cp bats.gz joker.gz                                           

gz % cat Joker.txt >> joker.gz                                           
gz % gunzip -c joker                                                     
I am batman
hello
gunzip: joker.gz: trailing garbage ignored

gz % cp joker.gz badRobin.gz                                             
gz % gzip -c robin.txt >> badRobin.gz                                             
gz % gunzip -c badRobin                                                           
I am batman
hello
gunzip: badRobin.gz: trailing garbage ignored

Here is the hexdump of the badRobin.gz

gz % hexdump -C badRobin.gz 
00000000  1f 8b 08 08 81 13 a9 66  00 03 62 61 74 6d 61 6e  |.......f..batman|
00000010  2e 74 78 74 00 f3 54 48  cc 55 48 4a 2c c9 4d cc  |.txt..TH.UHJ,.M.|
00000020  e3 02 00 f0 49 dd 72 0c  00 00 00 1f 8b 08 08 8f  |....I.r.........|
00000030  13 a9 66 00 03 68 65 6c  6c 6f 2e 74 78 74 00 cb  |..f..hello.txt..|
00000040  48 cd c9 c9 e7 02 00 20  30 3a 36 06 00 00 00 49  |H...... 0:6....I|
00000050  20 61 6d 20 74 68 65 20  4a 6f 6b 65 72 0a 1f 8b  | am the Joker...|
00000060  08 08 66 0f a9 66 00 03  72 6f 62 69 6e 2e 74 78  |..f..f..robin.tx|
00000070  74 00 0b ca 4f ca cc 53  f0 48 2d 4a e5 02 00 b3  |t...O..S.H-J....|
00000080  74 fc c1 0b 00 00 00                              |t......|
00000087

The following program will use GZIPInputStream and GzipCompressorinputStream to access badRobin.gz

package gzip;

import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream;
import org.testng.annotations.Test;

import java.io.FileInputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;

public class GZipInputStreamTest {
    @Test
    public void openWithGZipInputStream() throws IOException {
        String f = "gz/badRobin.gz";
        try (var is = new FileInputStream(f); var gis = new GZIPInputStream(is)) {
            var bytes = gis.readAllBytes();
            System.out.printf("contents: %s%n", new String(bytes));
        }
    }
    @Test
    public void openWithGzipCompressorInputStream() throws IOException {
        String f = "gz/badRobin.gz";;
        try (var is = new FileInputStream(f);
             var giz = new  GzipCompressorInputStream(is, true)) {
            var bytes = giz.readAllBytes();
            System.out.printf("contents: %s%n", new String(bytes));
        }
    }
}

The output from openWithGZipInputStream():

contents: I am batman
hello

The output from openWithGzipCompressorInputStream():

java.io.IOException: Garbage after a valid .gz stream

	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.init(GzipCompressorInputStream.java:193)
	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:356)
	at java.base/java.io.InputStream.readNBytes(InputStream.java:412)
	at java.base/java.io.InputStream.readAllBytes(InputStream.java:349)
	at gzip.GZipInputStreamTest.openWithGzipCompressorInputStream(GZipInputStreamTest.java:31)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:599)
	at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
	at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
	at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
	at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
	at org.testng.TestRunner.privateRun(TestRunner.java:764)
	at org.testng.TestRunner.run(TestRunner.java:585)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
	at org.testng.SuiteRunner.run(SuiteRunner.java:286)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
	at org.testng.TestNG.runSuites(TestNG.java:1069)
	at org.testng.TestNG.run(TestNG.java:1037)
	at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
	at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:105)


The behavior of GZIPInputStream is in-line with gzip and gunzip where it stops processing after encountering bad data after the 8 byte end trailer containing the CRC32 and ISIZE fields . GzipCompressorInputStream throws an IOException without returning any data.

WinZip on my Mac also opens badRobin.gz similar to gunzip, that is does not return an error.

Based on the above, I am reluctant to change the current behavior given it appears to have been modeled after gzip/gunzip as well as WinZip.

As far as providing a means to fail whenever there is bad data after the end trailer, I think we would need more input on the merits supporting this and how the behavior would be model if GZIPInputStream was enhanced.

Did you double check the unviewable issues? What does JDK-8023431 say?

It is just a fix for an internal intermittent test run failure. No additional details

LanceAndersen avatar Jul 30 '24 17:07 LanceAndersen

Based on the above, I am reluctant to change the current behavior given it appears to have been modeled after gzip/gunzip as well as WinZip.

That's a reasonable conclusion... and I have no problem with preserving the existing behavior if that's what we want. But in that case I would lobby that we should also provide some new way to configure a GZIPInputStream that guarantees reliable behavior.

The key question here is: "Exactly what current behavior of new GZIPInputStream(in) do we want to preserve?"

Let's start by assuming that we want your above test to pass. Putting that into words: "Given a single GZIP stream followed by trailing garbage, new GZIPInputStream(in) should successfully decode the GZIP stream and ignore the trailing garbage".

Note however that what new GZIPInputStream(in) currently provides is stronger than that:

  1. Trailing garbage is ignored
  2. Any IOException thrown while reading trailing garbage is ignored
  3. Concatenated streams are automatically decoded

So we know we want to preserve 1 - What about 2 and/or 3? Your thoughts?

My personal opinions:

  • I think 2 is inherently bad and it should not be implemented in any variant
  • I think 3 is not required by default, but one should be able to enable it somehow

If we were to accept those opinions (preserving only 1), then we would end up at the same place as GzipCompressorInputStream:

  • Underlying IOException's are never suppressed
  • new GZIPInputStream(in) decodes only one GIZP stream and ignores any trailing garbage
  • new GZIPInputStream(in, true) decodes concatenated streams; trailing garbage causes IOException

archiecobbs avatar Jul 30 '24 18:07 archiecobbs

Another (more conservative) possibility is to preserve both 1 ("Trailing garbage is ignored") and 3 ("Concatenated streams are automatically decoded") in the default configuration.

Then basically all we would be changing is no longer suppressing IOException's.

And then - as a separate remaining question - if we wanted to also provide more control there could be new constructor(s) to control concatenation and/or tolerance for trailing garbage.

(In all cases, I think using mark()/reset() (when available) to make trailing garbage detection precise is a good idea.)

archiecobbs avatar Jul 30 '24 19:07 archiecobbs

Another (more conservative) possibility is to preserve both 1 ("Trailing garbage is ignored") and 3 ("Concatenated streams are automatically decoded") in the default configuration.

Then basically all we would be changing is no longer suppressing IOException's.

And then - as a separate remaining question - if we wanted to also provide more control there could be new constructor(s) to control concatenation and/or tolerance for trailing garbage.

(In all cases, I think using mark()/reset() (when available) to make trailing garbage detection precise is a good idea.)

We don't want to change this long standing behavior as it has the potential of breaking existing applications and it is consistent with gzip and also winzip.

So through this PR, we should clarify the javadoc as to what current GZIPInputStream implementation does and add additional tests to expand the coverage

A separate discussion can take place to discuss the merits of whether there is perceived value in throwing an IOException when trailing garbage is encountered as well as any benefit of not supporting concatenated gzip files. It will also allow time for further review of other tools/apis that support gzip to see what they may or may not do.

LanceAndersen avatar Jul 31 '24 15:07 LanceAndersen

It seems like we're going around in circles a bit here... but maybe the circles are getting smaller? But I'm also running out of steam...

We don't want to change this long standing behavior as it has the potential of breaking existing applications and it is consistent with gzip and also winzip.

Sounds great. In fact my original plan was to not change the existing behavior at all (by putting all of the new/improved behavior behind a new constructor)...

So through this PR, we should clarify the javadoc as to what current GZIPInputStream implementation does and add additional tests to expand the coverage

Event just doing this has some subtlety to it... Note, none of the current behavior is actually specified (the entirety of the Javadoc says: "This class implements a stream filter for reading compressed data in the GZIP file format"), so any description we add creates a new actual specification for the future.

Are you sure we want to do that for all of the current behavior?

I would think maybe for some we do and for some we don't:

  • Swallowing IOException's - this we should not specify, so that we may someday eliminate it to make this class reliable
  • Concatenated GZIP's - this would be good to document
  • Reading extra bytes - the only thing to document would be that "there is a chance extra bytes will be read" which is not very useful, so what's the point?

What of that would you want to specify?

My original idea was to only add specification for the new/improved functionality, and leave the existing functionality. If we go this route, then the only thing to decide would be what new/improved functionality we want to add.

A separate discussion can take place to discuss the merits of whether there is perceived value in throwing an IOException when trailing garbage is encountered as well as any benefit of not supporting concatenated gzip files. It will also allow time for further review of other tools/apis that support gzip to see what they may or may not do.

Guess I thought we were sort-of having those discussions now...

What do we agree on?

  • Supporting concatenation is a good thing
  • Supporting ignoring trailing garbage is a good thing

What is there debate about?

  • Swallowing trailing garbage IOException's - I think it is a bad thing, but it's required for backward compatibility
  • Precise stops when markSupported() = true - I think it is a good thing (and it is backward compatible)
  • Support for reading only one GZIP stream - I think it is a good thing, especially useful when combined with previous

IMO doing nothing is a bad option, because as it stands today the class is fundamentally unreliable, and we should at minimum provide some new way to get reliable behavior (doing so doesn't require changing the existing behavior).

archiecobbs avatar Jul 31 '24 22:07 archiecobbs