dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Support Chat Mode

Open thomasahle opened this issue 3 months ago • 44 comments

Hopefully the new LM backend will allow us to make better use of models that are trained for "Chat". Below is a good example of how even good models like GPT-3.5 are currently have trouble understanding the basic DSPy format: Screenshot 2024-03-15 at 6 46 18 PM

Right now we use chat mode as if it was completion mode. We send:

messages: [
{"from": "user", "message": "guidance, input0, output0, input1, output1, input2"}
}

And expect the agent to reply with

{"from": "agent", "message": "output2"}

A better use of the Chat APIs would be to send

messages: [
{"from": "system", "message": guidance},
{"from": "user", "message": input0},
{"from": "agent", "message": output0},
{"from": "user", "message": input1},
{"from": "agent", "message": output2},
{"from": "user", "message": input2},
]

That is, we simulate a previous chat, where the agent always replied with the output in the format we expect. This teaches the agent to not start it's message with "OK! Let me get to it!" or repeating the template as in the gpt-3.5 screenshot above.

Also, using the system message for the guidance should help avoid prompt injection attacks.

thomasahle avatar Mar 16 '24 01:03 thomasahle

Totally agreed but I'd love for this to be more data-driven.

Either: (a) meta prompt engineering for popular models + easy addition of new adapters for new LMs if needed, or (b) automatic exploration of a new LM on standard tasks to automatically establish the patterns that "work" for that LM.

Do you have a way that fixes the sql_query example you had?

okhat avatar Mar 16 '24 01:03 okhat

Also I wonder to what extent this behavior you saw is because of "Follow the following format." is not explicitly saying "Complete the unfilled fields in accordance with the following format." Basically the instruction is slightly misleading for chat models.

okhat avatar Mar 16 '24 01:03 okhat

My sense is that interleaving inputs/outputs as a default would be a footgun because I would assume all outputs depend on all inputs, and the LLM doesn’t have access to this.

Right now our focus is using LiteLLM for broad support + moving over all dsp code into DSPy.

I’d love to tackle something like this when we look at the current Template usage and how that’s currently responsible for going from Example => prompt, and offering users some more flexibility with how an LLM gets called with an example.

CyrusOfEden avatar Mar 16 '24 03:03 CyrusOfEden

@CyrusOfEden I'm not sure what you mean by "interleaving inputs/outputs". This is already how DSPy works, no?

I think you misunderstood what I mean by (input_i, output_i). I'm talking about an entire example/demo. Not two fields from the same demo.

thomasahle avatar Mar 17 '24 00:03 thomasahle

I feel the problem with this is most of the time the positioning of the user end token. Say you have a prompt template that wraps the user message in [INST] and [/INST], like in mixtral. You would have:

[INST]
...

input1: ...
input2: ...
output:[/INST]

???

It seems to me that the model is lead to believe that the user turn is done (and the user has written a complete template, albeit with an empty output). It would be more correct to say:

[INST]
...

input1: ...
input2: ...[/INST]

output: ???

explicitly leaving the output field to the assistant (it suffices to create a {"role": "assistant", "content":"output:"} message).

Also I wonder to what extent this behavior you saw is because of "Follow the following format." is not explicitly saying "Complete the unfilled fields in accordance with the following format." Basically the instruction is slightly misleading for chat models.

@okhat I have done this experiment by writing a mini-version of Predict myself, with the prompt you are suggesting. I still have the same problem @thomasahle demonstrated in his initial post. This reinforces for me the belief that the token position is to blame. The version I proposed works instead.

Totally agreed but I'd love for this to be more data-driven.

I don't know precisely what you have in mind, but it seems to me that fixing the semantic of the multi-turn user-assistant conversation is orthogonal to the concern of wording the prompt differently.

meditans avatar Mar 17 '24 06:03 meditans

@meditans I suppose mixtral is not a "chat model" but an "instruction model".

What Omar says about having the framework automatically find the best prompting would og course be great. But if Mixtral can be shown to work better in 90% of cases with @meditans token placement, then I'd be more than happy to just have that built into the Mixtral LM class.

We may also note that others have thought about how to best do few shot prompting with chat models. Such as

  • Langchain: https://python.langchain.com/docs/modules/model_io/prompts/few_shot_examples_chat
  • Stack overflow: https://stackoverflow.com/questions/77285102/how-to-format-a-few-shot-prompt-for-gpt4-chat-completion-api

thomasahle avatar Mar 17 '24 07:03 thomasahle

@thomasahle you are right, I am using a (local, quantized) chat finetune of mixtral, not baseline mixtral.

In fact, the langchain page you proposed is quite close to what I'm saying here (essentially the same thing).

meditans avatar Mar 17 '24 07:03 meditans

@CyrusOfEden I'm not sure what you mean by "interleaving inputs/outputs". This is already how DSPy works, no?

I think you misunderstood what I mean by (input_i, output_i).

I'm talking about an entire example/demo. Not two fields from the same demo.

I see now and this makes sense to me! I thought it was inputs/outputs not examples/demos :)

CyrusOfEden avatar Mar 17 '24 18:03 CyrusOfEden

+1 to the problem @thomasahle is describing. I am also seeing it on gemini-1.0-pro.

And +1 to @meditans , the root of the problem is that special tokens for conversational formatting are being added to the prompt without anyone really thinking about it.

I like @thomasahle 's proposed solution of just formatting the few-shots in chat mode. The only downside I see is that it will no longer be possible to force the model to follow a specific prefix for the rationale. But this can probably be solved with some prompt engineering. Something like

Follow the following format.

Question: ${question}
Rationale: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

Repeat the user's message verbatim, and then finish the example.

mitchellgordon95 avatar Mar 19 '24 01:03 mitchellgordon95

Regardless of whether we do meta prompting or not, we will need to update the LM interface and template class to support chat formatting as a special case, since most LLM providers do not expose which special tokens they use to do chat formatting and only allow it through the API. This could probably be done during the LiteLLM integration.

And since we're going to do that, I think it would be good to just put a default chat-style format that works ok for most models, while structuring the code in such a way that meta optimization can be added easily later. My intuition is that default prompts just need to be "good enough" to bootstrap a few good traces, and as long as that works people won't really care about how good the default prompt format is or care to optimize it for their particular model.

mitchellgordon95 avatar Mar 19 '24 01:03 mitchellgordon95

I think it would be good to just put a default chat-style format that works ok for most models

When you say "default chat-style format", what do you have in mind? I struggle to understand if you're referring to the wording or the structure of the payload most api providers and local servers use.

meditans avatar Mar 19 '24 01:03 meditans

Also, regardless, could we leave a escape hatch for the user to provide a function that builds the arguments to send to the llm? Then one could just use the default one or provide tweaks.

meditans avatar Mar 19 '24 02:03 meditans

Adding few-shot examples in chat turns will probably not fix the fact that most programs will need to bootstrap starting from zero-shot prompts. But major +1 to any exploration of how to get most chat models to reliably understand that we want them to complete the missing fields.

okhat avatar Mar 19 '24 13:03 okhat

Btw I suspect this is easy. It's not happening right now just because no one ever tried :D. We've been using the same template since 2022 before RLHF and chat models (i.e., since text-davinci-002). The DSPy optimizers help make this less urgent than it would be otherwise because most models learn to do things properly with compiling, but ideally zero-shot (unoptimized) usage works reliably too. That will lead to better optimization.

okhat avatar Mar 19 '24 13:03 okhat

@isaacbmiller This is a great self-contained exploration. We can do this for 3-4 diverse chat models?

okhat avatar Mar 19 '24 13:03 okhat

Just catching up on this. It may be helpful for folks to take a look at the new Template class, it should contain all the TemplateV2/TemplateV3 functionality.

Additionally, all functionality for generating a prompt and passing to the LM, are contained within the new Backends themselves. We've already got a TemplateBackend which should match current functionality, along with a JSON backend which returns the content as a JSON directly.

We could always create a seperate version of the Template which returns the Signature + Examples, as a series of ChatML messages, which we then pass instead of a prompt directly to the LiteLM model.

Currently to call the LMs we do this:

# Generate Example
example = Example(demos=demos, **kwargs)

# Initialize and call template
# prompt is generated as a string
template = Template(signature)
prompt = template(example)

# Pass through language model provided
result = self.lm(prompt=prompt, **config))

It would be pretty straightforward to do something like this instead. For the BaseLM abstraction we could make both messages and prompt optional, and ensure that one or the other, but not both are provided.

# Generate Example
example = Example(demos=demos, **kwargs)

# Initialize and call template
# messages is generated as a [{"role": "...", "content": "..."}]
template = ChatTemplate(signature)
messages = template(example)

# Pass through language model provided
result = self.lm(messages=messages, **config))

@CyrusOfEden and I have chatted about this in the past, not sure how we should separate out Templates vs Backends. Each Backend, will need a Template of some kind to format prompts, but each Backend, can leverage a variety of Template so its not quite one-to-one.

We should be fairly close on landing the new Backend framework in Main, and then I think this is a great next step.

KCaverly avatar Mar 19 '24 19:03 KCaverly

I think it's an interesting idea to support multiple different Template's. I assume the code you wrote would all be inside Predict, so the user never actually has to call self.lm(...). Maybe a Predict can have a template, similar to how it has a signature. Then we can even have a TemplateOptimizer that optimizes the template the same way SignatureOptimizer optimizes the signature.

Then you code would look like this:

template = self.template_type(self.signature)
messages = template(example)

@CyrusOfEden and @KCaverly would this fit into the refactor?

thomasahle avatar Mar 19 '24 20:03 thomasahle

Some more examples of LMs being unable to understand the basic format: claude-3-opus-20240229: Screenshot 2024-03-20 at 4 27 03 PM gpt-4: Screenshot 2024-03-20 at 4 27 12 PM gpt-3.5-turbo: Screenshot 2024-03-20 at 4 27 08 PM gpt-3.5-turbo-instruct: Screenshot 2024-03-20 at 4 30 10 PM

thomasahle avatar Mar 20 '24 23:03 thomasahle

Relevant discussion: https://github.com/stanfordnlp/dspy/discussions/420

thomasnormal avatar Mar 21 '24 21:03 thomasnormal

I've been running into this problem as well using typed predictors.

@okhat I think your suggestion of substituting "Follow the following format." with "Complete the unfilled fields in accordance with the following format." if a chat model is used would go a long way in the short run, but in the long run an approach that adapts the prompting technique to each model would be idea.

I'll note that the problem is especially bad with the TypedChainOfThought predictor, because of the way this one mixes structured output with unstructured 'Think it through step by step'. This leads the model to produce bits of unstructured text where DSPy expects a structured output.

conradlee avatar Mar 27 '24 06:03 conradlee

FWIW - the new backed system would allow you to provide your own templates and supports chat mode. If you have a fuller example you can share, I would be keen to see if I can test it out and see if there are any improvements.

KCaverly avatar Mar 27 '24 11:03 KCaverly

@KCaverly Is there a better way to pass a template than through a config option?

isaacbmiller avatar Mar 27 '24 13:03 isaacbmiller

Ive been working on it here: #717.

So far, im passing it during generation. The backend has a default template argument, that can be overridden in modules when the backend is called. Which would allow us to either pass them dynamically as the Module evolves, or set it at the Module level, and pass through etc.

KCaverly avatar Mar 27 '24 13:03 KCaverly

That looks great. I will take an in-depth look later today.

Should I switch to building off of that branch?

isaacbmiller avatar Mar 27 '24 14:03 isaacbmiller

TIL you can "prefill" the responses from agents in both Claude and GPT: https://docs.anthropic.com/claude/docs/prefill-claudes-response

import anthropic

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key="my_api_key",
)
message = client.messages.create(
    model="claude-2.1",
    max_tokens=1000,
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": "Please extract the name, size, price, and color from this product description and output it within a JSON object.\n\n<description>The SmartHome Mini is a compact smart home assistant available in black or white for only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other connected devices via voice or app—no matter where you place it in your home. This affordable little hub brings convenient hands-free control to your smart devices.\n</description>"
        }
        {
            "role": "assistant",
            "content": "{"
        }
    ]
)
print(message.content)

This means we can still use "prefixes" when using the chat api.

Also, moving the first output-variable to the "agent side" is probably better than what we do now - putting it at the end of the user side. Similar to @meditans comment about Mixtral.

Does this fit into your new template system @KCaverly?

thomasahle avatar Mar 28 '24 19:03 thomasahle

If you take a look at the JSONBackend, we do something very similar. For json mode models, we prompt the model to complete the json as an incomplete object, as opposed to rewriting it from scratch. Additionally all of the demo objects are also shown in the completed JSON format which hopefully helps enforce the appropriate schema as well.

KCaverly avatar Mar 28 '24 19:03 KCaverly

I think it's an interesting idea to support multiple different Template's.

I also think so because i use a lot of models, every single one has a different template the problem is compounded by DSPY , but the good news is that we can probably organise custom templates in a special .contrib folder , so that templates that naturally have to be written for any new model (or task!) can also be pushed upstream.

Josephrp avatar Mar 28 '24 19:03 Josephrp

If you take a look at the JSONBackend, we do something very similar. For json mode models, we prompt the model to complete the json as an incomplete object, as opposed to rewriting it from scratch. Additionally all of the demo objects are also shown in the completed JSON format which hopefully helps enforce the appropriate schema as well.

Where should I look? In https://github.com/KCaverly/dspy/blob/f9c1adf837f1384fca60ed71dd2f32db47969746/dspy/modeling/templates/json.py#L29 it seems like everything gets stuffed into the user message.

But this prefill trick should actually be used by the text template/backend too, I believe. If the backend API is chat.

thomasahle avatar Mar 28 '24 20:03 thomasahle

Everything is currently stuffed into on message, but instead of providing the question and asking for a json response, we send an incomplete json object and ask to complete it. Not quite prefilling, but kinda similar.

KCaverly avatar Mar 28 '24 20:03 KCaverly

Sending an incomplete object can work well if the model understands it's supposed to complete it. Prefilling makes this easier for chat api models.

I'm not saying you always have to use prefilling, just asking if I'll be able to make a template that works this way?

I should probably pull your code and try it out 😁

thomasahle avatar Mar 29 '24 00:03 thomasahle