dspy
dspy copied to clipboard
Json Support
Is it possible to use json objects for context and answers? Is the best solution to do something like:
answer = dspy.OutputField(desc="""
Use the following format:
{"summary": string, "commands: [{"name": string}]}
"""
Or some other JSON Schema.
Hey @thomasahle. In typical usage, the Signature itself serves as the schema. That is, you can request multiple output fields (e.g., summary and commands) for your structure.
Here's one example:
class ExampleSignature(dspy.Signature):
question = dspy.InputField()
summary = dspy.OutputField()
json = dspy.OutputField(desc='key-value pairs')
generate = dspy.Predict(ExampleSignature)
response = generate(question="What are the acronyms of west-coast US states?")
Here, response is:
Prediction(
summary='The acronyms of west-coast US states are CA (California), OR (Oregon), and WA (Washington).',
json='{\n "CA": "California",\n "OR": "Oregon",\n "WA": "Washington"\n}'
)
This is zero-shot usage, which is not most reliable. Generally, you'd then do one of two things:
- Compile on a few examples to optimize the prompt so it does whatever you need more reliably. Usually, your goal isn't just valid JSON but also useful values in the JSON outputs (i.e., correct semantically and syntactically).
- Use
dspy.Suggest
with criteria for valid JSON fitting your schema, if any. This will retry if there are failures and has the capacity to optimize the prompt selection during compiling. Read more here about DSPy assertions.
That said, if your goal is explicitly to generate valid JSON, we are lacking on solid guidance and tooling for that. I think Outlines has the right answers there (though I suspect we can optimize cost-wise) and could serve as a great backend for structured outputs in DSPy.
Thanks Omar. I guess what would be cool would be if I could define a schema, like
states_schema = """{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The name of the state"
},
"abbreviation": {
"type": "string",
"description": "The two-letter postal abbreviation for the state"
},
"capital": {
"type": "string",
"description": "The capital city of the state"
}
},
"required": ["name", "abbreviation", "capital"]
}
}"""
And then write
class ExampleSignature(dspy.Signature):
question = dspy.InputField()
summary = dspy.OutputField()
states = dspy.OutputField(desc='list of states', schema=states_schema)
An alternative, which I don't know if would be easier or harder, would by Pydantic
class State(BaseModel):
name: str = Field(..., description="The name of the state")
abbreviation: str = Field(..., description="The two-letter postal abbreviation for the state")
capital: str = Field(..., description="The capital city of the state")
class ExampleSignature(dspy.Signature):
question = dspy.InputField()
summary = dspy.OutputField()
states = dspy.OutputField(desc='list of states', type=List[State])
But Pydantic support would follow directly from json schema support, since you can do
class States(BaseModel):
states: List[State] = Field(..., description="List of states")
class ExampleSignature(dspy.Signature):
question = dspy.InputField()
summary = dspy.OutputField()
states = dspy.OutputField(desc='list of states', schema=States.schema())
This would be very helpful
Here's an example of a DSPy program just putting the schema into the desc: https://gist.github.com/thomasahle/4f650afe7305601fc3e417dda7aecb3c
It works on this simple example. Not sure how far you can scale it up. The nice thing is that Pydantic provides useful error messages when the output is wrong, which can be fed back into dspy assertions.
I think this is such a common use-case, an example of using DSPy together with Outlines or Instructor would be very useful
I tried outlines
with OpenAI API, but it fails
I also had the same issue with guidance
and lmql
@younes-io This is not a feature that's in DSPy yet.
If you want something that works right now, you could try the code I linked above, or this https://github.com/thomasahle/dspy/blob/main/examples/types.py
@thomasahle I know it's not a DSPy feature; I was pointing out that Outlines do not provide JSON output due to OpenAI API limitations
Ah right. Outlines mostly works with local models, since they are trying to do very tightly controlled generation. But ideally of course all libraries should allow the Model layer to be flexible enough to take advantage of whichever API or model features are available to make dspy most efficient.
I was able to get the Instructor library to partially work with DSPy (for a basic program at least) - the inspect history doesn't work, but this at least gets a structured response from openai using Instructor and dspy. It took creating a custom LM and then passing the Pydantic response_model as an arg when instantiating the predict module.
Colab notebook is here.
Thanks Craig, For me this issue is actually solved ny by typed dspy. @CyrusOfEden is working on addingddeeper integration to openai. He'd probably be interested in discussing the options!
OK thanks @thomasahle. Could you comment on what you mean by typed dspy? I can't seem to find any reference to this. Thanks
Sure, I mean using pydantic types like this: https://github.com/stanfordnlp/dspy?tab=readme-ov-file#5-pydantic-types
@thomasahle the current solution is not as strong as what @stantonius proposed.
- The current dspy implementation gently asks the LLM to generate json, and retries multiple times if it fails. This generally only works for toy examples.
- The implementation proposed by stantonius is to rely on function calling or constrained generation thanks to the instructor package. This has higher garanties of generating good json.
How do you think we could add stantonius solution to dspy?
A workaround which works for OA and Llama family of models :append it to the list of items sent as context.
code snippet :
self.parser = PydanticOutputParser(pydantic_object=AnswerFormat) # Langchain way
self.instructions = self.parser.get_format_instructions()
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
.......
prediction = self.generate_answer(context=context + [self.instructions] , question=question)
@oulianov Instructor also uses prompting and retrying. In general, it is Turing Complete to validate a pydantic schema. Stuff like constrained generation only works on simple types / toy examples.
@thomasahle I don't understand your point.
From my experience, dspy DID NOT work on toy examples (invalid json, too many retries), while instructor DID work on complex examples.
This is because instructor's approach has overall MORE guarantees of respecting the schema.
How do you think we can combine both approaches to maximize dspy performances over all use cases ?