django-storages
django-storages copied to clipboard
Uploading .gz files with Content-Type: application/gzip and no Content-Encoding
This is regarding S3Boto3Storage.get_object_parameters
which was added in 1.9, but since the next major version is breaking changes anyway, I have a use-case for revising it.
This is somewhat of a follow-up to the double compression issue #451 which was addressed in Django-Storages 1.9.
My use case is that I want to store files with the filenames like "hello.csv.gz" and I want them to be downloaded from S3 as archives, not decompressed by the user-agent. I am compressing them as I stream out the CSV, so they are already compressed when they reach Django-Storages.
To download as archives, one must set Content-Type: application/gzip and NOT set Content-Encoding.
What I have finally figured out is that mimetypes.guess_type recommends the wrong Content-Type and Content-Encoding for .gz, if you want to download it as an archive.
mimetypes.guess_type('hello.csv.gz') ('text/csv', 'gzip') mimetypes.guess_type('hello.gz') (None, 'gzip')
This is no good if guess_type is being used to set Content-Type and Content-Encoding headers, and you want the file to download as-is without the browser de-compressing it or modifying the extension.
To download as-is, as an archive, guess_type should work the same as it does for .zip
mimetypes.guess_type('hello.csv.zip') ('application/zip', None)
So for this use-case, .gz should return:
mimetypes.guess_type('hello.csv.gz') ('application/gzip', None)
It is actually easy to force mimetypes.guess_type to do the right thing in this case:
mimetypes.encodings_map {'.gz': 'gzip', '.Z': 'compress', '.bz2': 'bzip2', '.xz': 'xz'} del mimetypes.encodings_map['.gz'] mimetypes.guess_type('hello.csv.gz') ('application/gzip', None)
Clearly this would be a sketchy patch. I guess the Pythonic way would be to instantiate the MimeTypes class with your desired data. See also: https://www.iana.org/assignments/media-types/media-types.xhtml https://tools.ietf.org/html/rfc6713
n.b. The non-standard .gzip extension returns:
mimetypes.guess_type('hello.csv.gzip') (None, None)
DjangoStorages uses mimetypes.guess_type to set Content-Type and Content-Encoding headesr and we need to intercept this wrong guess_type if we want our .csv.gz files to download properly as archives.
Version 1.9.1 of DjangoStorages introduces a method get_object_parameters
The method can be overridden to return parameters that are merged with the
param dict, so you could replace Content-Type and Content-Encoding.
def get_object_parameters(self, name):
return self.object_parameters.copy()
This is called by the private method _get_write_parameters
like this:
params.update(self.get_object_parameters(name))
Unfortunately, this method cannot REMOVE Content-Encoding from the param dict. So I have to override the private method _get_write_parameters.
class S3Boto3StorageFix(S3Boto3Storage):
def _get_write_parameters(self, name, content=None):
params = super(S3Boto3StorageFix, self)._get_write_parameters(name, content)
if name.endswith('.gz'):
params['ContentType'] = 'application/gzip'
if 'ContentEncoding' in params:
del params['ContentEncoding']
return params
If it's not too late, I propose the API should be:
def get_object_parameters(self, name, params):
return params.update(self.object_parameters.copy())
This would allow parameters to be removed as well as added and changed by overriding the public method.
~So the current docstring is a lie so I think it's already a bug.~ Current docstring is fine.
I don't want to break this API and don't like passing in params
. How about this instead:
def _get_write_parameters(self, name, content=None):
params = self.get_object_parameters(name)
_type, encoding = mimetypes.guess_type(name)
content_type = getattr(content, 'content_type', None)
content_type = content_type or _type or self.default_content_type
if 'ContentType' not in params:
params['ContentType'] = content_type
if 'ContentEncoding' not in params and encoding:
params['ContentEncoding'] = encoding
return params
More or less this is: if you set these things you are on your own.
As a patch:
def _get_write_parameters(self, name, content=None):
- params = {}
+ params = self.get_object_parameters(name)
_type, encoding = mimetypes.guess_type(name)
content_type = getattr(content, 'content_type', None)
content_type = content_type or _type or self.default_content_type
- params['ContentType'] = content_type
- if encoding:
+ if 'ContentType' not in params:
+ params['ContentType'] = content_type
+ if 'ContentEncoding' not in params and encoding:
params['ContentEncoding'] = encoding
- params.update(self.get_object_parameters(name))
return params
The idea here would be that you specify "ContentEncoding": None
which definitely feels like a hack but also is not the end of the world.
It does preserve the current public API for get_object_parameters
.
My only concern is, as written, boto3 would receive a header ContentEncoding=None in its kwargs. Does boto3 do the right thing in that case and NOT set the Content-Encoding metadata to blank on S3? Seems like it would be undesirable if I went into the S3 console and it showed two metadata, a correct Content-Type and a blank Content-Encoding. And I could see boto3 perhaps deciding to throw an error on receiving None.
Maybe there should be an extra clause to clean up:
if 'ContentEncoding' in params:
if params['ContentEncoding'] is None:
del params['ContentEncoding']
elif encoding:
params['ContentEncoding'] = encoding
I would suggest adding a comment in the get_object_parameters
docstring that mentions you can remove Content-Encoding by setting it to None, because that does seem slightly obscure.
@jschneier
From rfc7231:
... Content-Encoding is primarily used to allow a representation's data to be compressed without losing the identity of its underlying media type. ... If the media type includes an inherent encoding, such as a data format that is always compressed, then that encoding would not be restated in Content-Encoding ...
and, in slightly different words, from Mozilla:
... If the original media is encoded in some way (e.g. a zip file) then this information would not be included in the
Content-Encoding
header.
I think the question is: If a server does not adhere to the above, should it be considered broken, or should it just be considered irregular? If the former, then the present issue breaks our server.
Another interesting part of rfc7231:
... Likewise, an origin server might choose to publish the same data as multiple representations that differ only in whether the coding is defined as part of
Content-Type
orContent-Encoding
, since some user agents will behave differently in their handling of each response (e.g., open a "Save as ..." dialog instead of automatic decompression and rendering of content).
The idea here would be that you specify
"ContentEncoding": None
which definitely feels like a hack but also is not the end of the world.
@jschneier setting "ContentEncoding": None
would raise a validation error in the botocore.client:
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid type for parameter ContentEncoding, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
So that should probably be an empty string: "ContentEncoding": ""
This could be implemented as follows:
class MyS3Storage(S3Boto3Storage):
def get_object_parameters(self, name):
params = super().get_object_parameters(name=name)
if str(name).endswith('.gz'):
params['ContentType'] = 'application/gzip'
params['ContentEncoding'] = ''
return params
(as far as I know, this also works without the proposed patch)
However, the concerns raised by @normangilmore are justified:
-
This adds an empty
Content-Encoding
header to the S3 object metadata: -
The empty
Content-Encoding
is included in the response headers if we download the file from S3:
Is there an actual use-case where it is desirable that a file uploaded as e.g. .tar.gz
ends up as .tar
after download?
If not, I would simply override the guess from mimetypes
. For example:
def _get_write_parameters(self, name, content=None):
...
_type, encoding = mimetypes.guess_type(name)
if str(name).endswith('.gz'):
_type = 'application/gzip'
encoding = None
...
So, to be clear, as of django-storages 1.14
, we can prevent the ContentEncoding
header from being set, by do something like:
class MyS3Boto3Storage(storages.backends.s3boto3.S3Boto3Storage):
...
def get_object_parameters(self, name):
parameters = super().get_object_parameters(name=name)
if name.endswith('.gz'):
parameters['ContentType'] = 'application/gzip'
return parameters