vllm
vllm copied to clipboard
[Frontend] Implement Tool Calling with `tool_choice='required'`
[Draft]
This (Draft) PR will implement tool_choice='required' for the Tool Calling API, a requested feature for the OpenAI-compatible API server (#10526 and #13002).
Roadmap
1. Validation and Guided Decoding
- [x] Extend validation to support
tool_choice='required'. - [x] Implement structured decoding for
tool_choice='required'.
2. Chat Completions
- [x] Implement/extend non-streaming chat completions.
- [x] Extend to streaming chat completions.
3. Documentation
- [ ] Update documentation.
Looking for feedback on the implementation roadmap and collaborations are welcome!
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Excited for this to get merged!
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @meffmadd.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @meffmadd.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Hi @mgoin, I finished the basic implementation last week and I am pinging you to have a look yourself or assign reviewers. Thanks! 🚀
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @meffmadd.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @meffmadd.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Latest merge did not include changes of #14511 for the required case. Should probably be done with new PR?
@simon-mo @DarkLight1337 @robertgshaw2-redhat pinging for review/feedback.
Hey guys lets get this in?
Hey guys lets get this in?
Second this, it’s blocking downstream development
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @meffmadd.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Waiting for merging please 🙏
@NickLucche I trust your judgement, can you approve the PR if you think it's ready?
Taking another pass
Thanks for the feedback @NickLucche! I will add e2e tests ASAP!
The example still works for me locally. However, I am working on a M1 Macbook Pro at the moment, which means I am on CPU backend which in turn disables V1 engine (device type=cpu is not supported by the V1 Engine. Falling back to V0.). Then I get the log message: xgrammar does not support advanced JSON schema features like enums, patterns or numeric ranges. Falling back to use outlines instead.
Can you try to add --guided-decoding-backend outlines to the serve command and/or set VLLM_USE_V1=0? Does that fix it for you? We should definitely add this to the documentation as well!
So I can't use --guided-decoding-backend with V1, meaning this would only work with V0. Also, this is to be specified in the e2e test I was mentioning, as they ought to just fail on V1.
This is a bit unfortunate though as all efforts are mostly focused on V1 rn, I was hoping we could land on that.
I am not too familiar with xgrammar, would you know the amount of work it would take to get this working with V1? cc @mgoin
Based on the documentation, alternative decoding backends are planned in V1: https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html#feature-model
For this implementation to work, the guided decoding backend needs to allow JSON schema with support for anyOf to select sub-schemas for each of the functions in the input to generate the correct parameters for it.
Stop being nit picky about tests or your planned v0/v1 migration and merge this already!!!!!!!
Every day that this is not merged, vllm continues to be a laughing stock for any serious agent AI applications. AG2 continues to be a laughing stock for any serious agent AI applications (they only just merged support for this on their end 4 days ago).
It's insane that this wasn't literally priority 1 of an issue for your whole team for months.
Unblocked tool calling tests, we can merge this if they all pass
New e2e test might fail as I was just pushing my local changes to switch to my dev VM where I can run/debug tests on GPU etc. Lets see...
Massive thank you to everyone involved!!! This is huge and I will make a very positive post on linkedin about the new developments soon coming to vllm (I didn't expect this to get merged within hours of me complaining!)
AI lives and dies off of the often unpaid or unrewarded labor from open source devs like yourselves. Please excuse my prior frustration as I am a researcher who relies on both vllm and ag2 with stakeholders who have been waiting for local models to become useful for agents.
Giant thank you all!!!
I also got this error The provided JSON schema contains features not supported by xgrammar with Qwen2.5. Hopefully we can fix this soon