Zappa icon indicating copy to clipboard operation
Zappa copied to clipboard

(#908) Update BINARY_SUPPORT to use Content-Encoding to identify if data is binary

Open monkut opened this issue 2 years ago • 4 comments

migrate https://github.com/zappa/Zappa/pull/971 to lastest master

Contribution by @Quidge

Description

This PR re-implements the same functional change that was shown in a PR from the pre-fork Miserlou/zappa repository.

The change in that PR didn't make it into the Zappa 52 release from what I can tell. My company has been using that PR's code (in combination with the flask_compress library) for a couple weeks without issue so I'm opening this PR that copies that code.

Also, when using whitenoise for caching, which provides compression, binary types may include mimetypes, "text/", "application/json":

response.mimetype.startswith("text/") response.mimetype == "application/json" Assuming that Content-Encoding will be set (as whitenoise apparently does) this allows compression to be applied to assets which have mimetypes "text/" and "application/json", while allowing for uncompressed "text/" and "application/json" still to be served.

About Content-Encoding: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding

This PR adds a new settings, additional_text_mimetypes, to allow the user to define types they want to be considered as text when BINARY_SUPPORT is True.

GitHub Issues

#908

monkut avatar Jul 28 '22 01:07 monkut

Migrating conversions from previous PR

zeevt commented on 9 Dec 2021

CSV (and plain text) can be in any character encoding, it is not guaranteed that it's ASCII or UTF-8. If just .decode() works, great. If there is a header like Content-Type: text/csv; charset=windows-1252 (there usually isn't) then it can be decoded to str without relying on default charset. But if neither of those is true, then you have bytes that should be base64 encoded.

javulticat commented on 9 Dec 2021

.zeevt that's my understanding as well, so I believe it would be fair to say that this is a potential concern for any text file, correct? I was confused because .Quidge's message implied this was a concern specific to CSVs (which I definitely may have just misinterpreted - sorry!) If so, would a solution be to sniff the encoding of plain text and use that to decode it using the proper encoding (and, if that fails, fall back on using base64-encoded bytes)? Something like:

import chardet

# If plain text
try:
    encoding = chardet.detect(response.data)["encoding"]
    decoded = response.data.decode(encoding)
except Exception:
    # base64 encode the bytes

zeevt commented on 9 Dec 2021

javulticat You understood me correctly, yes. I have two concerns with this, now that I thought about it a tiny bit more: Maybe someone is trying to return an http response with "content-type: text/foo" and with a body encoded with whatever encoding on purpose, and the proposed code would decode it and re-encode it in UTF-8 if chardet guesses correctly. This would be a bug. They should be able to ensure the exact bytes they intended are received by the HTTP client, without forcing application/octet-stream content type. How fast is chardet? I would hesitate running it on every response without measuring first. I now think it's only safe to decode bytes to str if it actually decodes as UTF-8.

monkut avatar Jul 28 '22 01:07 monkut

Coverage Status

Coverage increased (+0.2%) to 74.303% when pulling 15532f1e5a1f285c93ad901c01602727b045684e on feature/issue-908-update-binarysupport-handling into 34e3065db10019e8646b5134af5defb1ef483d5c on master.

coveralls avatar Jul 28 '22 01:07 coveralls

It appears that the CSV related discussion is not directly related to BINARY_SUPPORT = True, but relates to the line below where text is decoded using the werkzug Response object's get_data(as_text=True) method.

zappa_returndict["body"] = response.get_data(as_text=True)

If that's correct, I think text decode is a separate issue from what the BINARY_SUPPORT feature is trying to do in this PR.

monkut avatar Jul 28 '22 06:07 monkut

@javulticat If you have some time, can you take a look at this, I tried to update it and and cleanup the implementation.

monkut avatar Sep 08 '22 01:09 monkut

(currently running this branch in a personal project of mine to confirm stability)

monkut avatar Oct 25 '22 22:10 monkut

Thank you maintainers for running with this for as long as you have. 🙏

Quidge avatar Nov 11 '22 14:11 Quidge