dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Json Support

Open thomasahle opened this issue 1 year ago • 9 comments

Is it possible to use json objects for context and answers? Is the best solution to do something like:

answer = dspy.OutputField(desc="""
    Use the following format:
    {"summary": string, "commands: [{"name": string}]}
"""

Or some other JSON Schema.

thomasahle avatar Jan 04 '24 15:01 thomasahle

Hey @thomasahle. In typical usage, the Signature itself serves as the schema. That is, you can request multiple output fields (e.g., summary and commands) for your structure.

Here's one example:

class ExampleSignature(dspy.Signature):
    question = dspy.InputField()
    summary = dspy.OutputField()
    json = dspy.OutputField(desc='key-value pairs')

generate = dspy.Predict(ExampleSignature)
response = generate(question="What are the acronyms of west-coast US states?")

Here, response is:

Prediction(
    summary='The acronyms of west-coast US states are CA (California), OR (Oregon), and WA (Washington).',
    json='{\n  "CA": "California",\n  "OR": "Oregon",\n  "WA": "Washington"\n}'
)

This is zero-shot usage, which is not most reliable. Generally, you'd then do one of two things:

  • Compile on a few examples to optimize the prompt so it does whatever you need more reliably. Usually, your goal isn't just valid JSON but also useful values in the JSON outputs (i.e., correct semantically and syntactically).
  • Use dspy.Suggest with criteria for valid JSON fitting your schema, if any. This will retry if there are failures and has the capacity to optimize the prompt selection during compiling. Read more here about DSPy assertions.

That said, if your goal is explicitly to generate valid JSON, we are lacking on solid guidance and tooling for that. I think Outlines has the right answers there (though I suspect we can optimize cost-wise) and could serve as a great backend for structured outputs in DSPy.

okhat avatar Jan 04 '24 16:01 okhat

Thanks Omar. I guess what would be cool would be if I could define a schema, like

states_schema = """{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string",
        "description": "The name of the state"
      },
      "abbreviation": {
        "type": "string",
        "description": "The two-letter postal abbreviation for the state"
      },
      "capital": {
        "type": "string",
        "description": "The capital city of the state"
      }
    },
    "required": ["name", "abbreviation", "capital"]
  }
}"""

And then write

class ExampleSignature(dspy.Signature):
    question = dspy.InputField()
    summary = dspy.OutputField()
    states = dspy.OutputField(desc='list of states', schema=states_schema)

An alternative, which I don't know if would be easier or harder, would by Pydantic

class State(BaseModel):
    name: str = Field(..., description="The name of the state")
    abbreviation: str = Field(..., description="The two-letter postal abbreviation for the state")
    capital: str = Field(..., description="The capital city of the state")

class ExampleSignature(dspy.Signature):
    question = dspy.InputField()
    summary = dspy.OutputField()
    states = dspy.OutputField(desc='list of states', type=List[State])

But Pydantic support would follow directly from json schema support, since you can do

class States(BaseModel):
    states: List[State] = Field(..., description="List of states")

class ExampleSignature(dspy.Signature):
    question = dspy.InputField()
    summary = dspy.OutputField()
    states = dspy.OutputField(desc='list of states', schema=States.schema())

thomasahle avatar Feb 06 '24 18:02 thomasahle

This would be very helpful

AriMKatz avatar Feb 06 '24 18:02 AriMKatz

Here's an example of a DSPy program just putting the schema into the desc: https://gist.github.com/thomasahle/4f650afe7305601fc3e417dda7aecb3c

It works on this simple example. Not sure how far you can scale it up. The nice thing is that Pydantic provides useful error messages when the output is wrong, which can be fed back into dspy assertions.

thomasahle avatar Feb 06 '24 18:02 thomasahle

I think this is such a common use-case, an example of using DSPy together with Outlines or Instructor would be very useful

j4k0bk avatar Feb 07 '24 05:02 j4k0bk

I tried outlines with OpenAI API, but it fails I also had the same issue with guidance and lmql

younes-io avatar Feb 11 '24 14:02 younes-io

@younes-io This is not a feature that's in DSPy yet.

If you want something that works right now, you could try the code I linked above, or this https://github.com/thomasahle/dspy/blob/main/examples/types.py

thomasahle avatar Feb 11 '24 15:02 thomasahle

@thomasahle I know it's not a DSPy feature; I was pointing out that Outlines do not provide JSON output due to OpenAI API limitations

younes-io avatar Feb 11 '24 16:02 younes-io

Ah right. Outlines mostly works with local models, since they are trying to do very tightly controlled generation. But ideally of course all libraries should allow the Model layer to be flexible enough to take advantage of whichever API or model features are available to make dspy most efficient.

thomasahle avatar Feb 11 '24 20:02 thomasahle

I was able to get the Instructor library to partially work with DSPy (for a basic program at least) - the inspect history doesn't work, but this at least gets a structured response from openai using Instructor and dspy. It took creating a custom LM and then passing the Pydantic response_model as an arg when instantiating the predict module.

Colab notebook is here.

stantonius avatar Mar 04 '24 03:03 stantonius

Thanks Craig, For me this issue is actually solved ny by typed dspy. @CyrusOfEden is working on addingddeeper integration to openai. He'd probably be interested in discussing the options!

thomasahle avatar Mar 04 '24 05:03 thomasahle

OK thanks @thomasahle. Could you comment on what you mean by typed dspy? I can't seem to find any reference to this. Thanks

stantonius avatar Mar 06 '24 12:03 stantonius

Sure, I mean using pydantic types like this: https://github.com/stanfordnlp/dspy?tab=readme-ov-file#5-pydantic-types

thomasahle avatar Mar 06 '24 15:03 thomasahle

@thomasahle the current solution is not as strong as what @stantonius proposed.

  1. The current dspy implementation gently asks the LLM to generate json, and retries multiple times if it fails. This generally only works for toy examples.
  2. The implementation proposed by stantonius is to rely on function calling or constrained generation thanks to the instructor package. This has higher garanties of generating good json.

How do you think we could add stantonius solution to dspy?

oulianov avatar May 14 '24 11:05 oulianov

A workaround which works for OA and Llama family of models :append it to the list of items sent as context.

code snippet :

        self.parser = PydanticOutputParser(pydantic_object=AnswerFormat) # Langchain way
        self.instructions = self.parser.get_format_instructions() 
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
         .......
        prediction = self.generate_answer(context=context  +  [self.instructions] , question=question)



milonbhattacharya avatar Jul 05 '24 04:07 milonbhattacharya

@oulianov Instructor also uses prompting and retrying. In general, it is Turing Complete to validate a pydantic schema. Stuff like constrained generation only works on simple types / toy examples.

thomasahle avatar Jul 10 '24 23:07 thomasahle

@thomasahle I don't understand your point.

From my experience, dspy DID NOT work on toy examples (invalid json, too many retries), while instructor DID work on complex examples.

This is because instructor's approach has overall MORE guarantees of respecting the schema.

How do you think we can combine both approaches to maximize dspy performances over all use cases ?

oulianov avatar Jul 10 '24 23:07 oulianov