vllm [Feature] Add support for LLama 3.1 tool use

This PR adds support for tool use in Llama 3.1 as documented here: https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling

Although the logic of the ToolParser for Llama is very similar to the existing ones, there are some significant differences in model behavior. For example the mistral models don't require a specific system prompt to prompt the model to use JSON-based tool calling, whereas LLama does. The Llama model also expects a JSON string for the tool output. Therefore I had to generalize the existing unit tests to be able to customize the tests fixtures for Llama.

Sep 10 '24 20:09 maxdebayser

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Sep 10 '24 20:09 github-actions[bot]

cc: @wseaton @njhill

Sep 12 '24 17:09 maxdebayser

Actually @K-Mistele I just remembered this... sorry if you already started work on it!

Sep 12 '24 17:09 DarkLight1337

Thanks for opening this @maxdebayser! I left comments with some initial thoughts. Maybe it would be good for us to connect and discuss the chat template and tool streaming - I found tool call quality, and the model's willingness to respond to tool call results with a natural-language answer, is highly sensitive to the system prompt, part of which is included in the chat template.

It's kind of well-known that the default chat template provided by meta in the docs does not work really at all for multi-turn conversations that involve tool calls; for best results modifications are needed. They really designed it for one-shot input -> tool call -> action; without interpretation of tool calls. Maybe I can open a draft so you can take a look at what I have so far too - I tried to design tool streaming to be more resilient at higher temperatures e.g. to not getting a <|python_tag|> at the beginning of the tool call; and used a simpler & more robust chat template.

Then we can take the best parts from our respective implementations and have something that's even more robust

Sep 12 '24 18:09 K-Mistele

Thanks for the review @K-Mistele , I agree that we probably need more tests. For example, with the LLama 3.1 70B model I've seen it do things like return an array in quotes, in other words an array serialized as string instead of an array proper. The question is what is the right place to handle these model idiosyncrasies given that they are influenced by the chat template that can be set by the user as well as the prompt.

Perhaps we could say that if the user is using the default chat template provided by Meta and using prompts in the format that the Meta documentation recommends, then anything that the model generates should be supported. But the problem is that the official documentation is very short and superficial. Do you know where we could find more instruction examples for JSON tool calling?

Sep 12 '24 19:09 maxdebayser

For example, with the LLama 3.1 70B model I've seen it do things like return an array in quotes, in other words an array serialized as string instead of an array proper.

This probably indicates a chat template issue; I have found that in almost case this is the root cause of malformed tool calls whether it's llama 3.1, Hermes 2, Hermes 3, or Mistral. Let me dig up the chat template I built.

Perhaps we could say that if the user is using the default chat template provided by Meta and using prompts in the format that the Meta documentation recommends, then anything that the model generates should be supported. But the problem is that the official documentation is very short and superficial. Do you know where we could find more instruction examples for JSON tool calling?

generally I have tended to think about this in terms of supporting how the model is expected to work, and fixing the issue at the chat template level & with system prompts to make it work that way, instead of just trying to bake in support for dozens of idiosyncracies and failure modes due to chat template issues. Especially knowing that the "vanilla" chat template provided by meta is (a) extraordinarily complicated (b)well-known to simply not work for multi-turn conversations with tool usage due to the model's prompting in the template and (c) full of typos

Sep 12 '24 19:09 K-Mistele

btw @maxdebayser this is the chat template I have built. It's not an understatement to say I have spent 4+ hours simplifying it from the original, tuning the system prompt, and evaluating it for multi-turn tool calling.

I think you might find (a) it's a lot simpler & easier to debug since I removed all the cases that don't matter (e.g. with built-in tools, and the option to provide the tool list in the first user message), (b) I added a system prompt that fixes the one-shot "fire and forget" tool calling approach that Meta's system prompt encourages where the model always wants to generate a tool call, even when it has called a tool and been provided with results, and (c) provides the tool list in the system prompt instead of the first user message, which seems to be better for multi-turn, multi-tool-call conversations

https://gist.github.com/K-Mistele/820d142b4dab50bd8ef0c7bbcad4515c

Sep 12 '24 23:09 K-Mistele

generally I have tended to think about this in terms of supporting how the model is expected to work, and fixing the issue at the chat template level & with system prompts to make it work that way, instead of just trying to bake in support for dozens of idiosyncracies and failure modes due to chat template issues. Especially knowing that the "vanilla" chat template provided by meta is (a) extraordinarily complicated (b)well-known to simply not work for multi-turn conversations with tool usage due to the model's prompting in the template and (c) full of typos

Do you think then that when the user adds --tool-parser option we should automatically load the chat template that is found to work best instead of the model's default template? Maybe we can open an issue for that and do it for all models.

But by "expected to work" do you mean work as documented by the model provider or by some other specification?

Sep 13 '24 12:09 maxdebayser

By "expected to work" I mean what the model's tool calling syntax specified by the manufacturer is. e.g. valid JSON or XML function calls. This is separate from the behavior actually produced by the prompt /templates that they provide, if they are broken (which is more common than most people think)

I don't think we should automatically load the chat template since that might lead to unintended behavior if the user explicitly specifies their own chat template. But I did document in the docs page at docs/source/serving/openai_compatible_server.md what the recommended configuration is for each model; you can see it here: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#id1

By the way, if you haven't updated that docs page for Llama 3.1 models, that might also be a good thing to do for this PR.

Sep 13 '24 15:09 K-Mistele

I don't think we should automatically load the chat template since that might lead to unintended behavior if the user explicitly specifies their own chat template.

Right, what I meant is that we could load that template if the user doesn't provide the --chat-template argument.

By the way, if you haven't updated that docs page for Llama 3.1 models, that might also be a good thing to do for this PR

Thanks, I've added it in the latest commit.

Update: @K-Mistele, I realized that github formatted this comment in a strange way that showed only the quoted text, so you might not have seen my comments.

Sep 13 '24 21:09 maxdebayser

btw @maxdebayser this is the chat template I have built. It's not an understatement to say I have spent 4+ hours simplifying it from the original, tuning the system prompt, and evaluating it for multi-turn tool calling.

I think you might find (a) it's a lot simpler & easier to debug since I removed all the cases that don't matter (e.g. with built-in tools, and the option to provide the tool list in the first user message), (b) I added a system prompt that fixes the one-shot "fire and forget" tool calling approach that Meta's system prompt encourages where the model always wants to generate a tool call, even when it has called a tool and been provided with results, and (c) provides the tool list in the system prompt instead of the first user message, which seems to be better for multi-turn, multi-tool-call conversations

https://gist.github.com/K-Mistele/820d142b4dab50bd8ef0c7bbcad4515c

This template looks great! Have you found a template that gets the model to consistently generate <|python_tag|> at all?