openapi-json-schema-generator
openapi-json-schema-generator copied to clipboard
[REQ] Allow Content Type JSON Lines
Is your feature request related to a problem? Please describe.
One of our endpoints provides a stream with content type application/json-lines based on this format. One example of the returned data would be
b'' {'data': {'test_attribute_1':' example', 'test_attribute_2': 'example 2'}}\n{'data': {'test_attribute_1': 'example 3', 'test_attribute_2': 'example 4'}}\n{"end": true}
Currently, the regex used to figure out if the content type is based on json will also match with the above used type. Consequently, it will call json.loads(response.data) which leads to an error as the byte includes multiple jsons.
In general, what is your approach for supporting different content types?
Describe the solution you'd like
It would be nice, if it could support this content type. The deserialization could then look something similar to this
all_data = []
for w in data.split(b'\n'):
all_data.append(json.loads(w))
However, I am not sure how such content should be validated.
Describe alternatives you've considered
If we just use application/octet-stream as the content type, I will get an error in the next validate_base step:
uai_annotation_store_client.exceptions.ApiTypeError: Invalid type. Required value type is str and passed type was FileIO at ['args[0]']
Additional context
Why are you sending json lines data as binary when plain text will work? It says that it is utf8 encoded so it could be string. Where is the spec definition of that content type and payload?
The approach to supporting different content types can be seen in the response body deserializer. They are explicitly handled on a case by case basis for types like
- plain text
- json
- octet stream
- multipart form data
Why are you sending json lines data as binary when plain text will work? It says that it is utf8 encoded so it could be string.
I am not sure as I joined the project after they decided on this content type. As it is not really a standardized content type, we are currently discussing if we should replace it with something else.
If we decide to stick with this content type, is it possible to support it in this library or is it required to be one of the more standardized types like the application/json? There is one other type that could be useful in our case.
I also had a quick look on the deserializer of the response but I only find the cases for the last three content types but not for the plain text. Did I miss something?
My mistake, plain text is not on the list in python.
So my preference is not to support undefined content types unless there is significant prior work showing how the content type is sent and significant user need (lots of people want it).
Both of these look to be streamed json responses. Why not just get back the raw response, and deserialize it manually in a helper that you define? It is not clear how to handle streams in openapi. Should a function consume the response until it ends? What if it never ends? How should one terminate consumption of the response data early?
One way to return the data would be to return an io.IOBase context manager, that way the calling code could iterate on it and be responsible for closing it.
There is apparently some traction on officially supporting streaming response in the OpenAPI specification. They will have a meeting tomorrow where they, among other things, discuss on how to support it. For more information see this issue and PR.
One way to return the data would be to return an io.IOBase context manager, that way the calling code could iterate on it and be responsible for closing it.
This would also mean that the validation is not automatically checked by the library but the user needs to do it manually after iterating on it? I think for our use case it is sufficient if we have a way to just get the response from the server without the validation.
When iterating the validation would be run
Should a function consume the response until it ends? What if it never ends? How should one terminate consumption of the response data early? One way to return the data would be to return an io.IOBase context manager, that way the calling code could iterate on it and be responsible for closing it.
Do you have a more detailed plan on how you would implement the functionality?
In our use case, we could work with either getting the raw response or supporting a different content type like json sequence.
As we need the streaming endpoint to work, is there a way I can help you with?
My responses described that a context would be returned and methods could be called on it to yield validated results. Json sequence is an acceptable feature add to the code base because it has a rfc.
Paths forward here are
- you calling existing raw response returning methods and deserializing the bytes like you describe. You can validate payloads using document defined schemas.
- you submitting a PR with a proposed feature
- Me submitting a PR with the feature. I am applying to jobs at this time. If this was something that you want, you paying me for the work would be motivating. Otherwise my suggestion is option 1 or 2.
What were the results of the openapi meeting?
@Marcelo00 never heard back from you here. How would you like to move forward with this?
Sorry, I forgot to inform you about our decision. For our use case it was sufficient enough to just get the raw response back.
I also watched a part of the recent openAPI meeting but it seems that it takes more time until the different streaming content types (such as jsonlines) are official supported by openAPI. However, version 3.0.4, 3.1.1 and 3.2.0 support two format options of the type string that can be used to define either bytes or binary depending on the actual content (see this link for the version 3.0.4). The PR I previously posted is also merged.
Closing this issue because the end user can use existing functionality (receive raw response and iterate through body deserializing each line of content using openapi document defined schemas) to meet their needs.