hadoop-connectors icon indicating copy to clipboard operation
hadoop-connectors copied to clipboard

Turn off decompressive decoding of gzip?

Open goaaron opened this issue 3 years ago • 3 comments

#66 Mentions the issue of data loss with the gcs hadoop connector being used by Spark.

Decompressive transcoding is an obstacle to my data pipeline.

I'm currently using Flink to stream file data out of GCS. Unfortunately my files are annotated with metadata: Content-Encoding:gzip and Cache-Control:no-transform and it seems to be stopping Flink from reading these files.

The GCS library decodes the gzips, but the Flink reader expects a Gzip format because files have a .gz extension. There's no way I can currently modify the metadata or names of the files I'm dealing with.

I've recompiled the hadoop-connector and set relevant headers to turn off decompressive transcoding in createDataRequest of GoogleCloudStorageReadChannel.

This doesn't work though.

My pipeline works fine if I manually drop the metadata Content-Encoding on the object in GCS, but this doesn't conform to my pipeline which determines new files to stream based off of their modification time.

I'm thinking my only outlet here is to write custom HadoopInputs to bypass the gzip encoding. Regardless, it'd be good to make the transcoding configurable in the GCS client apis.

goaaron avatar Dec 24 '21 04:12 goaaron

What GCS connector version do you use and compile manually?

You may try to change fs.gs.inputstream.support.gzip.encoding.enable property to true/false and check what will happen, but probably it will not solve your issue.

The root cause of the issue is that Google HTTP Client library that GCS connector uses to execute GCS API requests decompresses GZIP-encoded files automatically based on the encoding in HTTP header.

You can disable this behavior by setting responseReturnRawInputStream to false on HttpRequest object that GCS connector constructs.

medb avatar Dec 24 '21 04:12 medb

I'm using the latest version of this repository.

I tried modifying the hadoop property fs.gs.inputstream.support.gzip.encoding.enable and it needs to be true or else reading gzip just doesn't even try to happen. File reads won't even be attempted if it's false and theres a .gz extension.

I compile this repository from the root with mvn clean package -P hadoop2 and I take the gcs-connector-hadoop2-2.2.4-shaded.jar and use it as the connector.

I made your changes by calling getObject.setReturnRawInputStream(false) inside of com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel:1105.

It's the only place from the connector code I found that I could access the fields that get pushed into the HttpRequest. It looks like the com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl should pick up the com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel changes and proceed to get used in com.google.cloud.hadoop.fs.gcs.CoopLockFsckRunner which actually runs the operations.

This still does not work. Looks like the API for the com.google.api.services.storage.Get object that is created inside of the gcsio read channel automatically tries to initialize a data download and can't be told to return the rawResponseInputStream.

goaaron avatar Jan 04 '22 22:01 goaaron

I think that you need to add getObject.setReturnRawInputStream(true), not getObject.setReturnRawInputStream(false) call in GoogleCloudStorageReadChannel.

medb avatar Jan 05 '22 02:01 medb

@medb Is there any other way to handle this issue other than making the code changes and then creating the shaded fat jar? Also, the code changes mentioned don't work.

Hiteshitaneja avatar Oct 03 '23 18:10 Hiteshitaneja