gcsfs icon indicating copy to clipboard operation
gcsfs copied to clipboard

gcsfs doesn't properly handle gzipped files, ignoring content-encoding

Open jimmywan opened this issue 2 years ago • 4 comments

I have a file "foo.txt.gz" that has been uploaded with the following metadata:

Content-Type: text/plain Content-Encoding: gzip

I'm trying to copy its contents to a new file in cloud storage that is uncompressed to workaround a bug where my tooling (gcloud) can't properly handle gzip input.

If I try to pass the compression flag on read, it complains about the file not being a gzip file, implying that transcoding is occurring:

with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
...     with fs.open('gcs://jw-sandbox/uploads.txt', 'wb') as write_file:
...             shutil.copyfileobj(read_file, write_file)

gzip.BadGzipFile: Not a gzipped file (b'gs')

If I try to read the file without the compression flag and just dump contents to stdout, I only get the first N bytes of the decompressed contents where N is the compressed size:

with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
...     for f in read_file:
...             print(f)
>>> print(gcsfs.__version__)
2022.02.0

jimmywan avatar Mar 25 '22 20:03 jimmywan

Is this a duplicate of #233?

mhfrantz avatar Jul 21 '22 16:07 mhfrantz

Right, the linked issue suggests that the best thing to do is not set the content-encoding, which relates to the transfer rather than the status of the remote file. The transfer is compressed anyway, which for a gzip file will make little difference either way. I believe this isn't how GCS should hand'e this, but nothing to be done about that.

martindurant avatar Jul 21 '22 18:07 martindurant

I wonder, what happens if you .cat() your file with start=10, end=20?

martindurant avatar Jul 22 '22 13:07 martindurant