vllm
vllm copied to clipboard
[Feature]: support tool and reasoning together
🚀 The feature, motivation and pitch
For now --enable-auto-tool-choice and --enable-reasoning can't enable together, with the following errors:
# vllm serve /Qwen/QwQ-32B/ --served-model-name QwQ-32B --gpu-memory-utilization 0.97 --tensor-parallel-size 8 --max-model-len 32768 --enable-reasoning --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
INFO 03-07 18:14:44 [__init__.py:207] Automatically detected platform cuda.
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 70, in main
cmds[args.subparser].validate(args)
File "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 36, in validate
validate_parsed_serve_args(args)
File "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/openai/cli_args.py", line 285, in validate_parsed_serve_args
raise TypeError(
TypeError: Error: --enable-auto-tool-choice and --enable-reasoning cannot be enabled at the same time
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
+1
QwQ-32B is a reasoning model that supports tool use
This would be a killer feature.
I have implemented a solution for non streaming, will open a draft PR over the weekend after some more testing. Still need to do changes for the streaming version.
#14511 aims to fix the issue.
+1
Completed via #14511
QwQ-32B is a reasoning model that supports tool use
Tools call does not always responsed, despite QwQ-32B is trained based on Qwen2.5, which have tool in the chat template.
https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/tokenizer_config.json
Completed via #14490
didn't work for me with: tool_choice="required", @WangErXiao could you please take a look?
Completed via #14490
Sorry, it's https://github.com/vllm-project/vllm/pull/14511. Which error are you getting?
Completed via #14490
Sorry, it's #14511. Which error are you getting?
I am using latest version 0.8.4
sometimes i get pydantic error, sometimes empty reasoning output, sometimes get endless generation (for simple question as above)
Below I added steps to reproduce
(only by changing auto to required):
Does it work with auto?
sometimes i get pydantic error
Would you include it?
Does it work with auto?
Yes
sometimes i get pydantic error
For now, I couln't reproduce issue (it appears randomly, I guess when model starts reasoning output pydantic tool parser fails to parse json). But other issues reproducable:
Script to run server:
vllm serve Qwen/QwQ-32B-AWQ \
--gpu-memory-utilization 0.95 \
--quantization awq_marlin \
--tool-call-parser hermes \
--port 8000 \
--enable-reasoning \
--reasoning-parser deepseek_r1
client:
from openai import OpenAI
client = OpenAI(base_url="https://077e-89-250-84-146.ngrok-free.app/v1", api_key="dummy")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location", "unit"]
}
}
}]
response = client.chat.completions.create(
model=client.models.list().data[0].id,
messages=[{"role": "user", "content": "Which one is greater 3.11 or 3.9? And what's the weather like in San Francisco?"}],
tools=tools,
tool_choice="required"
)
print(response)
tool_call = response.choices[0].message.tool_calls[0].function
print(f"reasoning_content: {response.choices[0].message.reasoning_content}")
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
Issue: Either you will either get empty reasoning output (which leads to bad quality response) Either requests didn't complete in reasonable time (might take 10 minutes or more)
I was able to reproduce pydantic parsing issue by adding --guided-decoding-backend outlines to server code
vllm serve Qwen/QwQ-32B-AWQ \
--gpu-memory-utilization 0.95 \
--quantization awq_marlin \
--tool-call-parser hermes \
--port 8000 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--guided-decoding-backend outlines
{
"object": "error",
"message": "1 validation error for list[function-wrap[__log_extra_fields__()]]\n Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='Okay, let\\'s see. The us..., \"unit\": \"celsius\"}} ]', input_type=str]\n For further information visit https://errors.pydantic.dev/2.11/v/json_invalid",
"type": "BadRequestError",
"param": null,
"code": 400
}
Okay, let\'s see.
<think> tag is missing.
<think>tag is missing.
AFAIK <think> tag is included in HF tokenizer config (so we don't need to append it) and code works well with auto tool choice, only tool_choice='required' is causing problems
Which PR resolved https://github.com/vllm-project/vllm/issues/13375#issuecomment-2662713065?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Not stale, I fixed this issue in llama.cpp but vllm has the same issue :
https://github.com/vllm-project/vllm/blob/4fbd8bb597cf392b94def04a6955f22580356d76/vllm/entrypoints/openai/protocol.py#L712C9-L712C35
It's generating a json schema without allowing for the thinking tags
Llama.cpp issue https://github.com/ggml-org/llama.cpp/issues/15247 Llama.cpp PR https://github.com/ggml-org/llama.cpp/pull/15248