Zappa
Zappa copied to clipboard
(#908) Update BINARY_SUPPORT to use Content-Encoding to identify if data is binary
migrate https://github.com/zappa/Zappa/pull/971 to lastest master
Contribution by @Quidge
Description
This PR re-implements the same functional change that was shown in a PR from the pre-fork Miserlou/zappa repository.
The change in that PR didn't make it into the Zappa 52 release from what I can tell. My company has been using that PR's code (in combination with the flask_compress
library) for a couple weeks without issue so I'm opening this PR that copies that code.
Also, when using whitenoise for caching, which provides compression, binary types may include mimetypes, "text/", "application/json":
response.mimetype.startswith("text/") response.mimetype == "application/json" Assuming that Content-Encoding will be set (as whitenoise apparently does) this allows compression to be applied to assets which have mimetypes "text/" and "application/json", while allowing for uncompressed "text/" and "application/json" still to be served.
About Content-Encoding: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding
This PR adds a new settings, additional_text_mimetypes
, to allow the user to define types they want to be considered as text when BINARY_SUPPORT is True
.
GitHub Issues
#908
Migrating conversions from previous PR
zeevt commented on 9 Dec 2021
CSV (and plain text) can be in any character encoding, it is not guaranteed that it's ASCII or UTF-8. If just .decode() works, great. If there is a header like Content-Type: text/csv; charset=windows-1252 (there usually isn't) then it can be decoded to str without relying on default charset. But if neither of those is true, then you have bytes that should be base64 encoded.
javulticat commented on 9 Dec 2021
.zeevt that's my understanding as well, so I believe it would be fair to say that this is a potential concern for any text file, correct? I was confused because .Quidge's message implied this was a concern specific to CSVs (which I definitely may have just misinterpreted - sorry!) If so, would a solution be to sniff the encoding of plain text and use that to decode it using the proper encoding (and, if that fails, fall back on using base64-encoded bytes)? Something like:
import chardet
# If plain text
try:
encoding = chardet.detect(response.data)["encoding"]
decoded = response.data.decode(encoding)
except Exception:
# base64 encode the bytes
zeevt commented on 9 Dec 2021
javulticat You understood me correctly, yes. I have two concerns with this, now that I thought about it a tiny bit more: Maybe someone is trying to return an http response with "content-type: text/foo" and with a body encoded with whatever encoding on purpose, and the proposed code would decode it and re-encode it in UTF-8 if chardet guesses correctly. This would be a bug. They should be able to ensure the exact bytes they intended are received by the HTTP client, without forcing application/octet-stream content type. How fast is chardet? I would hesitate running it on every response without measuring first. I now think it's only safe to decode bytes to str if it actually decodes as UTF-8.
Coverage increased (+0.2%) to 74.303% when pulling 15532f1e5a1f285c93ad901c01602727b045684e on feature/issue-908-update-binarysupport-handling into 34e3065db10019e8646b5134af5defb1ef483d5c on master.
It appears that the CSV related discussion is not directly related to BINARY_SUPPORT = True
, but relates to the line below where text is decoded using the werkzug Response object's get_data(as_text=True)
method.
zappa_returndict["body"] = response.get_data(as_text=True)
If that's correct, I think text decode is a separate issue from what the BINARY_SUPPORT feature is trying to do in this PR.
@javulticat If you have some time, can you take a look at this, I tried to update it and and cleanup the implementation.
(currently running this branch in a personal project of mine to confirm stability)
Thank you maintainers for running with this for as long as you have. 🙏