hadoop-connectors
hadoop-connectors copied to clipboard
Turn off decompressive decoding of gzip?
#66 Mentions the issue of data loss with the gcs hadoop connector being used by Spark.
Decompressive transcoding is an obstacle to my data pipeline.
I'm currently using Flink to stream file data out of GCS. Unfortunately my files are annotated with metadata:
Content-Encoding:gzip
and Cache-Control:no-transform
and it seems to be stopping Flink from reading these files.
The GCS library decodes the gzips, but the Flink reader expects a Gzip format because files have a .gz
extension. There's no way I can currently modify the metadata or names of the files I'm dealing with.
I've recompiled the hadoop-connector and set relevant headers to turn off decompressive transcoding in createDataRequest
of GoogleCloudStorageReadChannel
.
This doesn't work though.
My pipeline works fine if I manually drop the metadata Content-Encoding
on the object in GCS, but this doesn't conform to my pipeline which determines new files to stream based off of their modification time.
I'm thinking my only outlet here is to write custom HadoopInputs to bypass the gzip encoding. Regardless, it'd be good to make the transcoding configurable in the GCS client apis.
What GCS connector version do you use and compile manually?
You may try to change fs.gs.inputstream.support.gzip.encoding.enable
property to true
/false
and check what will happen, but probably it will not solve your issue.
The root cause of the issue is that Google HTTP Client library that GCS connector uses to execute GCS API requests decompresses GZIP-encoded files automatically based on the encoding in HTTP header.
You can disable this behavior by setting responseReturnRawInputStream
to false
on HttpRequest object that GCS connector constructs.
I'm using the latest version of this repository.
I tried modifying the hadoop property fs.gs.inputstream.support.gzip.encoding.enable
and it needs to be true or else reading gzip just doesn't even try to happen. File reads won't even be attempted if it's false and theres a .gz
extension.
I compile this repository from the root with mvn clean package -P hadoop2
and I take the gcs-connector-hadoop2-2.2.4-shaded.jar
and use it as the connector.
I made your changes by calling getObject.setReturnRawInputStream(false)
inside of com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel:1105
.
It's the only place from the connector code I found that I could access the fields that get pushed into the HttpRequest. It looks like the com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl
should pick up the com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel
changes and proceed to get used in com.google.cloud.hadoop.fs.gcs.CoopLockFsckRunner
which actually runs the operations.
This still does not work. Looks like the API for the com.google.api.services.storage.Get
object that is created inside of the gcsio read channel automatically tries to initialize a data download and can't be told to return the rawResponseInputStream.
I think that you need to add getObject.setReturnRawInputStream(true)
, not getObject.setReturnRawInputStream(false)
call in GoogleCloudStorageReadChannel
.
@medb Is there any other way to handle this issue other than making the code changes and then creating the shaded fat jar? Also, the code changes mentioned don't work.