vllm [Feature]: support tool and reasoning together

🚀 The feature, motivation and pitch

For now --enable-auto-tool-choice and --enable-reasoning can't enable together, with the following errors:

# vllm serve /Qwen/QwQ-32B/ --served-model-name QwQ-32B --gpu-memory-utilization 0.97 --tensor-parallel-size 8  --max-model-len 32768  --enable-reasoning --reasoning-parser deepseek_r1  --enable-auto-tool-choice --tool-call-parser hermes
INFO 03-07 18:14:44 [__init__.py:207] Automatically detected platform cuda.
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 70, in main
    cmds[args.subparser].validate(args)
  File "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 36, in validate
    validate_parsed_serve_args(args)
  File "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/openai/cli_args.py", line 285, in validate_parsed_serve_args
    raise TypeError(
TypeError: Error: --enable-auto-tool-choice and --enable-reasoning cannot be enabled at the same time

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Mar 07 '25 10:03 NiuBlibing

+1

Mar 07 '25 10:03 HermitSun

QwQ-32B is a reasoning model that supports tool use

Mar 07 '25 15:03 jiangyinzuo

This would be a killer feature.

Mar 07 '25 16:03 Superskyyy

I have implemented a solution for non streaming, will open a draft PR over the weekend after some more testing. Still need to do changes for the streaming version.

Mar 08 '25 02:03 Superskyyy

#14511 aims to fix the issue.

Mar 09 '25 07:03 jiangyinzuo

+1

Mar 20 '25 02:03 EvanSong77

Completed via #14511

Mar 26 '25 15:03 SmartManoj

QwQ-32B is a reasoning model that supports tool use

Tools call does not always responsed, despite QwQ-32B is trained based on Qwen2.5, which have tool in the chat template.

https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/tokenizer_config.json

Apr 01 '25 03:04 tunglinwood

Completed via #14490

didn't work for me with: tool_choice="required", @WangErXiao could you please take a look?

Apr 23 '25 11:04 BMukhtar

Completed via #14490

Sorry, it's https://github.com/vllm-project/vllm/pull/14511. Which error are you getting?

Apr 23 '25 11:04 SmartManoj

Completed via #14490

Sorry, it's #14511. Which error are you getting?

I am using latest version 0.8.4

sometimes i get pydantic error, sometimes empty reasoning output, sometimes get endless generation (for simple question as above)

Below I added steps to reproduce

Apr 23 '25 13:04 BMukhtar

(only by changing auto to required):

Does it work with auto?

sometimes i get pydantic error

Would you include it?

Apr 23 '25 13:04 SmartManoj

Does it work with auto?

Yes

sometimes i get pydantic error

For now, I couln't reproduce issue (it appears randomly, I guess when model starts reasoning output pydantic tool parser fails to parse json). But other issues reproducable:

Script to run server:

vllm serve Qwen/QwQ-32B-AWQ \
  --gpu-memory-utilization 0.95 \
  --quantization awq_marlin \
  --tool-call-parser hermes \
  --port 8000 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1

client:

from openai import OpenAI

client = OpenAI(base_url="https://077e-89-250-84-146.ngrok-free.app/v1", api_key="dummy")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location", "unit"]
        }
    }
}]

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": "Which one is greater 3.11 or 3.9? And what's the weather like in San Francisco?"}],
    tools=tools,
    tool_choice="required"
)

print(response)
tool_call = response.choices[0].message.tool_calls[0].function

print(f"reasoning_content: {response.choices[0].message.reasoning_content}")
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")

Issue: Either you will either get empty reasoning output (which leads to bad quality response) Either requests didn't complete in reasonable time (might take 10 minutes or more)

Apr 23 '25 13:04 BMukhtar

I was able to reproduce pydantic parsing issue by adding --guided-decoding-backend outlines to server code

vllm serve Qwen/QwQ-32B-AWQ \
  --gpu-memory-utilization 0.95 \
  --quantization awq_marlin \
  --tool-call-parser hermes \
  --port 8000 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --guided-decoding-backend outlines

{
    "object": "error",
    "message": "1 validation error for list[function-wrap[__log_extra_fields__()]]\n  Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='Okay, let\\'s see. The us..., \"unit\": \"celsius\"}} ]', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid",
    "type": "BadRequestError",
    "param": null,
    "code": 400
}

Apr 23 '25 13:04 BMukhtar

Okay, let\'s see.

<think> tag is missing.

Apr 23 '25 14:04 SmartManoj

<think> tag is missing.

AFAIK <think> tag is included in HF tokenizer config (so we don't need to append it) and code works well with auto tool choice, only tool_choice='required' is causing problems

Apr 23 '25 14:04 BMukhtar

Which PR resolved https://github.com/vllm-project/vllm/issues/13375#issuecomment-2662713065?

Apr 23 '25 14:04 SmartManoj

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Jul 23 '25 02:07 github-actions[bot]

Not stale, I fixed this issue in llama.cpp but vllm has the same issue :

https://github.com/vllm-project/vllm/blob/4fbd8bb597cf392b94def04a6955f22580356d76/vllm/entrypoints/openai/protocol.py#L712C9-L712C35

It's generating a json schema without allowing for the thinking tags

Llama.cpp issue https://github.com/ggml-org/llama.cpp/issues/15247 Llama.cpp PR https://github.com/ggml-org/llama.cpp/pull/15248

Aug 12 '25 05:08 ExtReMLapin

vllm vllm copied to clipboard

[Feature]: support tool and reasoning together

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

vllm
vllm copied to clipboard