gcsfs
gcsfs copied to clipboard
gcsfs doesn't properly handle gzipped files, ignoring content-encoding
I have a file "foo.txt.gz" that has been uploaded with the following metadata:
Content-Type: text/plain Content-Encoding: gzip
I'm trying to copy its contents to a new file in cloud storage that is uncompressed to workaround a bug where my tooling (gcloud) can't properly handle gzip input.
If I try to pass the compression flag on read, it complains about the file not being a gzip file, implying that transcoding is occurring:
with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
... with fs.open('gcs://jw-sandbox/uploads.txt', 'wb') as write_file:
... shutil.copyfileobj(read_file, write_file)
gzip.BadGzipFile: Not a gzipped file (b'gs')
If I try to read the file without the compression flag and just dump contents to stdout, I only get the first N bytes of the decompressed contents where N is the compressed size:
with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
... for f in read_file:
... print(f)
>>> print(gcsfs.__version__)
2022.02.0
Is this a duplicate of #233?
Right, the linked issue suggests that the best thing to do is not set the content-encoding, which relates to the transfer rather than the status of the remote file. The transfer is compressed anyway, which for a gzip file will make little difference either way. I believe this isn't how GCS should hand'e this, but nothing to be done about that.
I wonder, what happens if you .cat()
your file with start=10, end=20?