[python] Binary string response encoding with Python 3
When using binary string responses under Python 3, the encoding defaults to UTF8 which is incompatible with binary data. This looks to stem from Python 3 having str default to UTF8 and a new type called bytes which represents binary data. This is a significant departure from functionality in Python 2 and you can read more about it.
This issue does not exist in Python 2.
Swagger Revision: 18262c1b6132a28438e2c3d9426458c413f011f8 (master as of right now)
Response schema:
schema:
type: string
format: binary
Example Error:
Traceback (most recent call last):
File "./dr.py", line 13, in <module>
"document_type": "pdf", # pdf or xls or xlsx
File "/usr/local/lib/python3.5/site-packages/docraptor/apis/doc_api.py", line 203, in create_doc
callback=params.get('callback'))
File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 322, in call_api
response_type, auth_settings, callback)
File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 149, in __call_api
post_params=post_params, body=body)
File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 358, in request
body=body)
File "/usr/local/lib/python3.5/site-packages/docraptor/rest.py", line 208, in POST
body=body)
File "/usr/local/lib/python3.5/site-packages/docraptor/rest.py", line 171, in request
r.data = r.data.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
I'm not sure what a good general fix would look like but I have a work-around for our use by delaying byte decoding until right before JSON deserialization and short-circuiting str. You can see my fix directly on generated code.
Related Issues and PRs:
- https://github.com/DocRaptor/docraptor-python/issues/2
- https://github.com/DocRaptor/docraptor-python/pull/3
@jqr thanks for the details. May I know if you've cycle to contribute the fix with a test case using this endpoint (the method name is automatically generated by codegen since it's empty)?
I'm not entirely sure the fix is generic enough to be used everywhere, as I only tested our use cases.
If someone else wants to run with this, they should as I won't have the time to address this in a generic fashion for a few weeks at least.
@jqr please take your time. We've not heard anyone else requesting this feature in Python yet.
Binary file responses are also being parsed as utf-8, causing errors. Does anyone know a temporary workaround for this?
schema:
type: file
format: binary
@JHibbard What about simply model the response as just type: file without format: binary?
The server sends bytes and client wants to decode it with utf8.
I've resolved it with removing the line: r.data = r.data.decode('utf8') in the rest.py file.
But then in the api_client.py file, in function __deserialize_file(self, response) it wants to write the response.data in an text file.
So I've edited this also, so the function opens the file as "wb":
with open(path, "wb") as f:
f.write(response.data)
But I don't know that what will now not work, but at least the file download works for me.
@wing328 I see the same happening in python 2.7, and I think is a more generic problem that binary files are not written to disk correctly because they may be corrupted by the "w" mode instead of "wb" as pointed out by @microo8 .
I think it would be formally more correct to have wb instead since both text and binary file are expected to be deserialized in here.
@g-bon I wonder if you can submit a PR with the suggested fix so that we can review more easily (remember to update the Python Petstore sample so that CIs can run the integration tests)
@wing328, switching to python 3 I realized that there are actually two separate issues. The first one is what I fixed with the PR 6936 and it affected both 2 and 3
The second one to my understanding is that in rest.py if PY3 is used and _preload_content is True, the response contents is always decoded as utf8. Which doesn't make sense for non-text files.
if six.PY3:
r.data = r.data.decode('utf8')
https://github.com/swagger-api/swagger-codegen/blob/62444a7aaf480d0fc12c048ce7c86f8978a16a89/modules/swagger-codegen/src/main/resources/python/rest.mustache#L213
- A simple approach could be to catch the UnicodeDecodeError and manage / swallow it. Not great because it would also affect eventual real UnicodeDecodeErrors.
- Another way could be to add a check on the content_type and make a whitelist/blacklist of content types that should be decoded to utf8.
- The third and hardest approach would be to infer the content type from the actual data but feels a fragile and overkill approach to me.
Happy to work on this when I have some confirmation the problem is here / feedback on the right approach.
Up cc @kenjones-cisco , @wing328
For my case, I'm returning a geojson file. This means that it is able to be decoded to utf8, but this will fail with the __deserialize_file method since that opens the file for binary writing. My solution was to remove the decode from "rest.moustache" (as was suggested), and move it without modification to after the file write conditionally happens in "api_client.moustache" (which was reasonable for my use case -- it's either a file or json/text.
def deserialize(self, response, response_type):
"""Deserializes response into an object.
:param response: RESTResponse object to be deserialized.
:param response_type: class literal for
deserialized object, or string of class name.
:return: deserialized object.
"""
# handle file downloading
# save response body into a tmp file and return the instance
if response_type == "file":
return self.__deserialize_file(response)
########### Moved from rest.mustache
if six.PY3:
response.data = response.data.decode('utf8')
logger.debug("response body: %s", response.data)
###########
# fetch data from response object
try:
data = json.loads(response.data)
except ValueError:
data = response.data
return self.__deserialize(data, response_type)
@rddesmond thanks for the great workaround!
From what I see the issue is not in skipping some particular Content-Type like in #9980 (which will fail for example for application/pdf, images etc), or having some hardcoded 'file' type which will trigger the different response processing logic.
The real issue is that swagger-codegen drops the format: binary from the model for type: string. If you'll check form parameters processing with debugOperations, you'll see the x-is-binary in vendorExtensions, which can be used in the template. But the response parameter doesn't have it, making it impossible to differentiate between strings and binary data.
Just want to say I've just hit on this issue as well.
Also hit the issue today
Still blocking in 2023
This worked for me:
Location: RESTClientObject.request (line 221)
# Use try/except to handle issue:
# https://github.com/swagger-api/swagger-codegen/issues/2305
try:
# In the python 3, the response.data is bytes.
# we need to decode it to string.
if six.PY3:
r.data = r.data.decode('utf8')
except UnicodeDecodeError:
content_type = r.getheader("Content-Type", None)
if 'application/zip' in content_type:
pass # allow zip content to be passed as bytes
else:
raise
This could be modified/expanded to match your content type.
You can also skip the serialization entirely by passing _preload_content=False in the method's kwargs: you will receive the HTTPResponse object with the bytes instead of the serialized data.
folder_path = "/path/to/output/folder"
res = api.my_api_endpoint(my_param, [...], _preload_content=False)
zip_bytes = io.BytesIO(res.data)
with zipfile.ZipFile(zip_bytes, "r") as zip_file:
zip_file.extractall(folder_path) # works!
Still, it's not an ideal solution and hopefully the issue is patched soon.
I have just hit the issue with latest swager version. May we know when can we have this fix merged or any help needed to get this thing in?
The issue has not been fixed since 2016. The most recent release (I'm using Docker swaggerapi/swagger-codegen-cli-v3 version 3.0.61) still has the bug.
I have a patch similar to some above, before the return of RESTClientObject.request in the rest.py file :
### START OF MANUAL PATCH
# see https://github.com/swagger-api/swagger-codegen/issues/2305
# In Python3, the `response.data` is bytes, so we need to decode it into str.
if six.PY3:
try:
r.data = r.data.decode('utf8')
except:
# some requests are not strings so we can't decode them
pass
### END OF MANUAL PATCH
I have put up a Minimal Reproducible Example :
- a demo server (to start separately)
- an OpenAPI file to generate the client
- a python file that do a request on the server
- an unit test file that checks the result of
ApiClient.deserialize - a patch.diff to apply the fix
- and a script to run all of this in a reproducible manner
➡️ swagger_codegen_issue_2305.zip
I think the problem stems from the way ApiClient.deserialize works (cf my unit test).
Click here for an explanation of the problem with doing `str(some_bytes)` in Python3
Sadly, the `str` type applied on `bytes` returns a string representing the "binary string" of the content.>>> str(b"example") # bytes
"b'example'"
>>> str("example") # Unicode string
'example'
Notice the leading b, and the inner single quotes (') in the first case.
Instead, the std.decode("some-encoding") method should be used.
>>> b"example".decode("utf-8")
'example'
This seems contrived, only because these are all ASCII characters. I can provide you real examples with non-ASCII characters.
But we can't deserialize bytes to str without knowing the encoding. And it is not always known, or true.
Per this StackOverflow answer, we should assume UTF-8 for application/json, and RFCs for HTTP give some defaults for MIME types of plain/*, but this is very limited.
So I don't know how to cleanly solve this issue, but I know that returning "b'something'" is incorrect.
I have opened other issues for Swagger Codegen, but there seem to be a lack of maintainers (at least for Python), so I don't know if there can be a way forward.