gcsfs Set content encoding (i.e. gzip) when writing file

When writing files, it's helpful to be able to specify the Content-Encoding in addition to the Content-Type. I didn't see a way I could pass that option through. Otherwise, another call to setxattrs is needed.

Current:

path = 'gs://some_bucket/some_blob.gz'
with fs.open(path, mode='wb', content_type='application/gzip') as o:
    with gzip.open(o, mode='wb') as o2:
        o2.write(b'Hello world')
fs.setxattrs(path, content_encoding='gzip', content_type='text/plain')

Proposed:

path = 'gs://some_bucket/some_blob.gz'
with fs.open(path, mode='wb', content_type='text/plain', content_encoding='gzip') as o:
    with gzip.open(o, mode='wb') as o2:
        o2.write(b'Hello world')

Nov 11 '20 22:11 isaacbrodsky

Sure, this sounds like a reasonable request and ought to be easy enough to implement. Are you interested?

However, beware that the transport layer also has a concept of compression encoding ("transfer-encoding") that may cause automatic decompression of downloads without passing to gzip, and possibly mismatches of the actual filesize and/or bytes ranges. Indeed, if you don't do your own compression step as in your example, compression may have been happening anyway (but the uncompressed file would be stored). Please test this! The file size stored should not be the same as the number of data bytes, and reading the file should require explicitly passing to gzip.

By the way, fsspec supports reading with handling the compression on the fly

with fsspec.open("gcs://...", compression='gzip') as f:
    f.read(...)

Nov 17 '20 14:11 martindurant

@martindurant I hit the same issue and thought it was implemented as part of open(). Instead, one need to patch the file once closed.

I understand transfer-encoding might be in the way but honestly I would just follow the guidance of G: https://cloud.google.com/storage/docs/transcoding and assume the user knows the proper use of content_encoding

Mar 24 '21 15:03 yan-hic

@yiga2 , you probably understand the proper workflow better than I do.

There is generally a difference conceptually between storing compressed, but of type octet-stream, or storing compressed with type gzip, or storing the original uncompressed and using compression only for the transfer (which would be handled by the comm layer). I don't know how to unscramble these.

Mar 24 '21 16:03 martindurant

No need to unscramble. It is not any different from someone uploading a gzip but incorrectly marking it (through content_type) as a json or text. Hence I agree with OP and would just pass whatever value to the content_encoding metadata for writing

gcsfs does not currently validate the content_type value nor does it check if the file is indeed compressed if named-type is gzip. Same would apply for content_encoding.

For reading, that arg should be ignored so no automatic decompressing - gcsfs is not a browser and we should just return the content as stored.

Mar 24 '21 16:03 yan-hic

@isaacbrodsky I could achieve this through fixed_key_metadata, follow an example: with fs.open(path, 'wb', content_type='text/plain', fixed_key_metadata={'content_encoding': 'gzip'}) as o: I think this issue may be closed, don`t you?

Jan 31 '23 20:01 juliowerner