openapi-json-schema-generator icon indicating copy to clipboard operation
openapi-json-schema-generator copied to clipboard

[REQ] Allow Content Type JSON Lines

Open Marcelo00 opened this issue 1 year ago • 9 comments

Is your feature request related to a problem? Please describe.

One of our endpoints provides a stream with content type application/json-lines based on this format. One example of the returned data would be

b'' {'data': {'test_attribute_1':' example', 'test_attribute_2': 'example 2'}}\n{'data': {'test_attribute_1': 'example 3', 'test_attribute_2': 'example 4'}}\n{"end": true}

Currently, the regex used to figure out if the content type is based on json will also match with the above used type. Consequently, it will call json.loads(response.data) which leads to an error as the byte includes multiple jsons.

In general, what is your approach for supporting different content types?

Describe the solution you'd like

It would be nice, if it could support this content type. The deserialization could then look something similar to this

all_data = []
for w in data.split(b'\n'):
    all_data.append(json.loads(w))

However, I am not sure how such content should be validated.

Describe alternatives you've considered

If we just use application/octet-stream as the content type, I will get an error in the next validate_base step: uai_annotation_store_client.exceptions.ApiTypeError: Invalid type. Required value type is str and passed type was FileIO at ['args[0]']

Additional context

Marcelo00 avatar Apr 22 '24 17:04 Marcelo00

Why are you sending json lines data as binary when plain text will work? It says that it is utf8 encoded so it could be string. Where is the spec definition of that content type and payload?

spacether avatar Apr 22 '24 18:04 spacether

The approach to supporting different content types can be seen in the response body deserializer. They are explicitly handled on a case by case basis for types like

  • plain text
  • json
  • octet stream
  • multipart form data

spacether avatar Apr 22 '24 18:04 spacether

Why are you sending json lines data as binary when plain text will work? It says that it is utf8 encoded so it could be string.

I am not sure as I joined the project after they decided on this content type. As it is not really a standardized content type, we are currently discussing if we should replace it with something else.

If we decide to stick with this content type, is it possible to support it in this library or is it required to be one of the more standardized types like the application/json? There is one other type that could be useful in our case.

I also had a quick look on the deserializer of the response but I only find the cases for the last three content types but not for the plain text. Did I miss something?

Marcelo00 avatar Apr 23 '24 14:04 Marcelo00

My mistake, plain text is not on the list in python.

spacether avatar Apr 23 '24 17:04 spacether

So my preference is not to support undefined content types unless there is significant prior work showing how the content type is sent and significant user need (lots of people want it).

Both of these look to be streamed json responses. Why not just get back the raw response, and deserialize it manually in a helper that you define? It is not clear how to handle streams in openapi. Should a function consume the response until it ends? What if it never ends? How should one terminate consumption of the response data early?

One way to return the data would be to return an io.IOBase context manager, that way the calling code could iterate on it and be responsible for closing it.

spacether avatar Apr 23 '24 17:04 spacether

There is apparently some traction on officially supporting streaming response in the OpenAPI specification. They will have a meeting tomorrow where they, among other things, discuss on how to support it. For more information see this issue and PR.

One way to return the data would be to return an io.IOBase context manager, that way the calling code could iterate on it and be responsible for closing it.

This would also mean that the validation is not automatically checked by the library but the user needs to do it manually after iterating on it? I think for our use case it is sufficient if we have a way to just get the response from the server without the validation.

Marcelo00 avatar Apr 24 '24 09:04 Marcelo00

When iterating the validation would be run

spacether avatar Apr 24 '24 14:04 spacether

Should a function consume the response until it ends? What if it never ends? How should one terminate consumption of the response data early? One way to return the data would be to return an io.IOBase context manager, that way the calling code could iterate on it and be responsible for closing it.

Do you have a more detailed plan on how you would implement the functionality?

In our use case, we could work with either getting the raw response or supporting a different content type like json sequence.

As we need the streaming endpoint to work, is there a way I can help you with?

Marcelo00 avatar Apr 29 '24 15:04 Marcelo00

My responses described that a context would be returned and methods could be called on it to yield validated results. Json sequence is an acceptable feature add to the code base because it has a rfc.

Paths forward here are

  1. you calling existing raw response returning methods and deserializing the bytes like you describe. You can validate payloads using document defined schemas.
  2. you submitting a PR with a proposed feature
  3. Me submitting a PR with the feature. I am applying to jobs at this time. If this was something that you want, you paying me for the work would be motivating. Otherwise my suggestion is option 1 or 2.

What were the results of the openapi meeting?

spacether avatar Apr 29 '24 15:04 spacether

@Marcelo00 never heard back from you here. How would you like to move forward with this?

spacether avatar May 13 '24 16:05 spacether

Sorry, I forgot to inform you about our decision. For our use case it was sufficient enough to just get the raw response back.

I also watched a part of the recent openAPI meeting but it seems that it takes more time until the different streaming content types (such as jsonlines) are official supported by openAPI. However, version 3.0.4, 3.1.1 and 3.2.0 support two format options of the type string that can be used to define either bytes or binary depending on the actual content (see this link for the version 3.0.4). The PR I previously posted is also merged.

Marcelo00 avatar May 14 '24 14:05 Marcelo00

Closing this issue because the end user can use existing functionality (receive raw response and iterate through body deserializing each line of content using openapi document defined schemas) to meet their needs.

spacether avatar May 14 '24 16:05 spacether