django-storages icon indicating copy to clipboard operation
django-storages copied to clipboard

Uploading .gz files with Content-Type: application/gzip and no Content-Encoding

Open normangilmore opened this issue 4 years ago • 8 comments

This is regarding S3Boto3Storage.get_object_parameters which was added in 1.9, but since the next major version is breaking changes anyway, I have a use-case for revising it.

This is somewhat of a follow-up to the double compression issue #451 which was addressed in Django-Storages 1.9.

My use case is that I want to store files with the filenames like "hello.csv.gz" and I want them to be downloaded from S3 as archives, not decompressed by the user-agent. I am compressing them as I stream out the CSV, so they are already compressed when they reach Django-Storages.

To download as archives, one must set Content-Type: application/gzip and NOT set Content-Encoding.

What I have finally figured out is that mimetypes.guess_type recommends the wrong Content-Type and Content-Encoding for .gz, if you want to download it as an archive.

mimetypes.guess_type('hello.csv.gz') ('text/csv', 'gzip') mimetypes.guess_type('hello.gz') (None, 'gzip')

This is no good if guess_type is being used to set Content-Type and Content-Encoding headers, and you want the file to download as-is without the browser de-compressing it or modifying the extension.

To download as-is, as an archive, guess_type should work the same as it does for .zip

mimetypes.guess_type('hello.csv.zip') ('application/zip', None)

So for this use-case, .gz should return:

mimetypes.guess_type('hello.csv.gz') ('application/gzip', None)

It is actually easy to force mimetypes.guess_type to do the right thing in this case:

mimetypes.encodings_map {'.gz': 'gzip', '.Z': 'compress', '.bz2': 'bzip2', '.xz': 'xz'} del mimetypes.encodings_map['.gz'] mimetypes.guess_type('hello.csv.gz') ('application/gzip', None)

Clearly this would be a sketchy patch. I guess the Pythonic way would be to instantiate the MimeTypes class with your desired data. See also: https://www.iana.org/assignments/media-types/media-types.xhtml https://tools.ietf.org/html/rfc6713

n.b. The non-standard .gzip extension returns:

mimetypes.guess_type('hello.csv.gzip') (None, None)

DjangoStorages uses mimetypes.guess_type to set Content-Type and Content-Encoding headesr and we need to intercept this wrong guess_type if we want our .csv.gz files to download properly as archives.

Version 1.9.1 of DjangoStorages introduces a method get_object_parameters The method can be overridden to return parameters that are merged with the param dict, so you could replace Content-Type and Content-Encoding.

def get_object_parameters(self, name):
    return self.object_parameters.copy()

This is called by the private method _get_write_parameters like this: params.update(self.get_object_parameters(name))

Unfortunately, this method cannot REMOVE Content-Encoding from the param dict. So I have to override the private method _get_write_parameters.

class S3Boto3StorageFix(S3Boto3Storage):
    def _get_write_parameters(self, name, content=None):
        params = super(S3Boto3StorageFix, self)._get_write_parameters(name, content)
        if name.endswith('.gz'):
            params['ContentType'] = 'application/gzip'
            if 'ContentEncoding' in params:
                del params['ContentEncoding']
        return params

If it's not too late, I propose the API should be:

def get_object_parameters(self, name, params):
    return params.update(self.object_parameters.copy())

This would allow parameters to be removed as well as added and changed by overriding the public method.

normangilmore avatar Aug 18 '20 20:08 normangilmore

~So the current docstring is a lie so I think it's already a bug.~ Current docstring is fine.

I don't want to break this API and don't like passing in params. How about this instead:

    def _get_write_parameters(self, name, content=None):
        params = self.get_object_parameters(name)

        _type, encoding = mimetypes.guess_type(name)
        content_type = getattr(content, 'content_type', None)
        content_type = content_type or _type or self.default_content_type

        if 'ContentType' not in params:
            params['ContentType'] = content_type
        if 'ContentEncoding' not in params and encoding:
            params['ContentEncoding'] = encoding

        return params

More or less this is: if you set these things you are on your own.

jschneier avatar Aug 31 '20 03:08 jschneier

As a patch:

     def _get_write_parameters(self, name, content=None):
-        params = {}
+        params = self.get_object_parameters(name)
 
         _type, encoding = mimetypes.guess_type(name)
         content_type = getattr(content, 'content_type', None)
         content_type = content_type or _type or self.default_content_type
 
-        params['ContentType'] = content_type
-        if encoding:
+        if 'ContentType' not in params:
+            params['ContentType'] = content_type
+        if 'ContentEncoding' not in params and encoding:
             params['ContentEncoding'] = encoding
 
-        params.update(self.get_object_parameters(name))
         return params

jschneier avatar Aug 31 '20 03:08 jschneier

The idea here would be that you specify "ContentEncoding": None which definitely feels like a hack but also is not the end of the world.

jschneier avatar Aug 31 '20 04:08 jschneier

It does preserve the current public API for get_object_parameters.

My only concern is, as written, boto3 would receive a header ContentEncoding=None in its kwargs. Does boto3 do the right thing in that case and NOT set the Content-Encoding metadata to blank on S3? Seems like it would be undesirable if I went into the S3 console and it showed two metadata, a correct Content-Type and a blank Content-Encoding. And I could see boto3 perhaps deciding to throw an error on receiving None.

Maybe there should be an extra clause to clean up:

if 'ContentEncoding' in params:
    if params['ContentEncoding'] is None:
        del params['ContentEncoding']
elif encoding:
    params['ContentEncoding'] = encoding

I would suggest adding a comment in the get_object_parameters docstring that mentions you can remove Content-Encoding by setting it to None, because that does seem slightly obscure.

normangilmore avatar Sep 01 '20 23:09 normangilmore

@jschneier

From rfc7231:

... Content-Encoding is primarily used to allow a representation's data to be compressed without losing the identity of its underlying media type. ... If the media type includes an inherent encoding, such as a data format that is always compressed, then that encoding would not be restated in Content-Encoding ...

and, in slightly different words, from Mozilla:

... If the original media is encoded in some way (e.g. a zip file) then this information would not be included in the Content-Encoding header.

I think the question is: If a server does not adhere to the above, should it be considered broken, or should it just be considered irregular? If the former, then the present issue breaks our server.

dennisvang avatar Jul 20 '22 16:07 dennisvang

Another interesting part of rfc7231:

... Likewise, an origin server might choose to publish the same data as multiple representations that differ only in whether the coding is defined as part of Content-Type or Content-Encoding, since some user agents will behave differently in their handling of each response (e.g., open a "Save as ..." dialog instead of automatic decompression and rendering of content).

dennisvang avatar Jul 20 '22 16:07 dennisvang

The idea here would be that you specify "ContentEncoding": None which definitely feels like a hack but also is not the end of the world.

@jschneier setting "ContentEncoding": None would raise a validation error in the botocore.client:

botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid type for parameter ContentEncoding, value: None, type: <class 'NoneType'>, valid types: <class 'str'>

So that should probably be an empty string: "ContentEncoding": ""

This could be implemented as follows:

class MyS3Storage(S3Boto3Storage):
    def get_object_parameters(self, name):
        params = super().get_object_parameters(name=name)
        if str(name).endswith('.gz'):
            params['ContentType'] = 'application/gzip'
            params['ContentEncoding'] = ''
        return params

(as far as I know, this also works without the proposed patch)

However, the concerns raised by @normangilmore are justified:

  • This adds an empty Content-Encoding header to the S3 object metadata:

    Screenshot S3 bucket

  • The empty Content-Encoding is included in the response headers if we download the file from S3:

    Screenshot download response headers (Firefox)

dennisvang avatar Jul 21 '22 15:07 dennisvang

Is there an actual use-case where it is desirable that a file uploaded as e.g. .tar.gz ends up as .tar after download?

If not, I would simply override the guess from mimetypes. For example:

def _get_write_parameters(self, name, content=None):
    ...
    _type, encoding = mimetypes.guess_type(name)
    if str(name).endswith('.gz'):
        _type = 'application/gzip'
        encoding = None
    ...

dennisvang avatar Jul 21 '22 20:07 dennisvang

So, to be clear, as of django-storages 1.14, we can prevent the ContentEncoding header from being set, by do something like:

class MyS3Boto3Storage(storages.backends.s3boto3.S3Boto3Storage):
    ...

    def get_object_parameters(self, name):
        parameters = super().get_object_parameters(name=name)
        if name.endswith('.gz'):
            parameters['ContentType'] = 'application/gzip'
        return parameters

dennisvang avatar Apr 04 '24 12:04 dennisvang