ion-java icon indicating copy to clipboard operation
ion-java copied to clipboard

Remove auto-detection of gzipped streams and byte[] from readers

Open raganhan opened this issue 6 years ago • 2 comments
trafficstars

Decompressing streams and byte[] is an orthogonal concern over parsing and shouldn't be part of ion-java. Supporting this feature has caused memory leaks, see https://github.com/amzn/ion-java/issues/198, and makes its public API more confusing than necessary.

raganhan avatar Jan 08 '19 18:01 raganhan

I disagree with the premise of this feature. Saying that we only auto-detect GZIP from some input sources and not others is more confusing than doing it everywhere. Removing it everywhere is a major backwards-incompatibility, and it's forcing customers to do work that's hard to get right (#198 as proof-by-example). As a historical point, at the time that I implemented auto-detection, it was by far the most requested feature from users of the library.

The read-from-byte[] #198 issue is in fact easily solved: check the array to see if its compressed, and if so, uncompress it entirely into another buffer, then read the results.

toddjonker avatar Jan 10 '19 20:01 toddjonker

Sorry my issue description might not have been clear. We are planning on completely removing this feature from ion-java and yes it would be a backward incompatible change.

We discussed internally what's the best course of action for this feature before deciding on removing it and creating the issue. We now realize that having internal discussions without properly describing the though process is bad as it locks out external contributors from participating, we'll try to be more mindful of this in the future.

I'm going to try to sum up what we discussed and the reasoning behind removing this feature.

The following APIs support the gzip auto-detect feature, ones marked with :fire: are susceptible to the memory leak in the current release (1.2.0):

  1. IonLoader#load(byte[]) :fire:
  2. IonLoader#load(InputStream) :fire:
  3. IonReaderBuilder#build(byte[])
  4. IonReaderBuilder#build(byte[], int, int)
  5. IonReaderBuilder#build(InputStream)
  6. IonSystem#iterate(InputStream) :fire:
  7. IonSystem#iterate(byte[]) :fire:
  8. IonSystem#singleValue(byte[] ionData) :fire:
  9. IonSystem#newReader(byte[] ionData)
  10. IonSystem#newReader(byte[] ionData, int offset, int len)
  11. IonSystem#newReader(InputStream ionData)

We can bucket those methods into 4 buckets:

  1. Methods that return an IonReader, or any Closeable resource are not susceptible to the memory leak because the user can close the returned reader closing any of the internal gzip auto-detect InputStream that are created.
  2. Methods that take in byte[] as parameters and are self contained it's easy to fix as any internal gzip auto-detect InputStream that needs to be created is fully utilized inside that method scope so we can close it before exiting the method.
  3. Methods that return an Iterator are not self contained and return resources that must be closed by the user to avoid memory leaks. We introduced a CloseableIterator to give the user this option. As you pointed in https://github.com/amzn/ion-java/issues/210 this may break binary compatibility. This was intended as an intermediate step preparing to remove the feature. If it turns out to break binary compatibility might as well remove the feature without the intermediate step.
  4. Methods that take in an InputStream as input and are self contained to fix it we have to close the internal gzip auto-detect InputStream which causes the InputStream passed as a parameter to be closed as well. Closing resources passed in as parameter is normally not a good idea even when fully consuming the stream in the method as the user could be passing in a stream that allows him to seek back.

buckets 3 and 4 have non standard behavior. Specially 3 where common idioms may still leave the resource open, which you pointed out in a comment elsewhere and I added an example bellow for context.

// iterator is hidden from user and is not closed 
for(IonValue v : ion.iterate(myBytes[])) {
    // ...
} 

To mitigate that we made the CloseableIterator close itself when hasNext() == false but that's also not ideal.

After considering this we came to the conclusion that handling gzip auto-detection, or any other compression for that matter, is better handled on user side. Doing so we can streamline ion-java apis to work on top of InputStreams with the byte[] alternatives as syntax sugar.

We can provide a auto-detecting InputStream wrapper for users, would be something similar to GzipOrRawInputStream preferably on a separate package, so users have a standard way of dealing with this. This would give users a fairly straightforward migration path and removes a feature from ion-java that is not directly related to parsing and handling Ion.

raganhan avatar Feb 05 '19 21:02 raganhan