Replace JSON parsing with pydantic from_json
Use from_json from pydantic to replace the existing streamed JSON parsing:
https://github.com/jackmpcollins/magentic/blob/ee082fc02e0edb9914b76d191ed38d6e2d42f027/src/magentic/streaming.py#L125-L209
https://docs.pydantic.dev/latest/concepts/json/#partial-json-parsing
from pydantic_core import from_json
partial_json_data = '["aa", "bb", "c'
result = from_json(partial_json_data, allow_partial=True)
print(result)
#> ['aa', 'bb']
Something like
current_object = 0
accumlated_chunks = ""
for chunk in stream:
accumulated_chunks += chunk
objects = from_json(accumulated_chunks)
# Current object is complete when next object is being generated
if len(objects) > current_object + 1:
yield objects[-1]
current_object += 1
yield objects[-1]
Performance is probably not a concern as the LLM generation will most likely be the bottleneck.
Hello @jackmpcollins
I’ve been using this project and would love to contribute and help improve it in return! :) I have some experience in model processing with Python, and I’d be happy to start with this issue. However, if there are more urgent tasks, feel free to let me know.
@jo-s-eph Thanks! This would be a good one to start with. Please also keep track of (or fix) any tricky parts of getting set up to contribute. Hopefully it's as easy as running make help and then following the commands there to install etc.
Thinking more about this issue now, in order to generate an array/list response from the LLM magentic gives a schema for a pydantic model like
class Output(BaseModel):
value: list[str]
So the response that comes back is not an array but actually a JSON like {"value": ["one", "two", ...]}. So solution to this is probably something like
- function that takes the iterable of string chunks and yields the concatenated string after receiving each chunk
- function that tasks a pydantic model and iterable of string chunks and (using the above function) yields an instance of this (partial) pydantic model after receiving each chunk
- where
iter_streamed_json_arrayis currently used have something similar to the pseudocode above that loops over theOutputinstance created after receiving each chunk, and yields the items out ofvalueas they complete
@jackmpcollins
Sorry for the delay, I've investigated the issue with replacing the JSON parsing with Pydantic's from_json, and I've discovered some limitations that might suggest that we should keep the current JsonArrayParserState implementation:
Numeric Keys Limitation
I created test cases demonstrating that from_json cannot properly handle numeric keys in JSON objects, which are valid in our use case. For example:
https://github.com/jackmpcollins/magentic/blob/ee082fc02e0edb9914b76d191ed38d6e2d42f027/tests/test_streaming.py#L129
# Fails with: key must be a string at line 1 column 13
from_json('{"value": [{2: "test"}]}')
# Fails with: key must be a string at line 1 column 20
from_json('{"value": [{"a": 1, 2: "b"}]}')
Streaming Issues When attempting to parse partial JSON with numeric keys:
# Fails immediately on numeric key
from_json('{"value": [{2: "te', allow_partial=True)
# Cannot handle split numeric keys
from_json('{"value": [{', allow_partial=True)
from_json('2: "test"}]}', allow_partial=True)
I've created a simple test file demonstrating these limitations and others : test_pydantic_json_keys.py
Pydantic's from_json with allow_partial=True is a great feature, but it doesn't meet our specific needs. We require the ability to handle non-standard JSON with numeric keys, preserve exact key types, and process streaming data with split chunks. Therefore, I recommend keeping the current JsonArrayParserState implementation.
Let me know if you agree with this assessment!
@jo-s-eph From what I can see, JSON keys must be strings, so this seems fine to me! https://restfulapi.net/valid-json-key-names/
Is there a separate streaming issue, or is this only in the case of numeric keys too? Thanks