vllm
vllm copied to clipboard
[Bug]: Different behavior with tool-use response parsing with streaming vs non-streaming when using max_tokens
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
Model Input Dumps
No response
🐛 Describe the bug
When I run the server with Mistral with auto-tool support
vllm serve mistralai/Mistral-7B-Instruct-v0.3 --enable-auto-tool-choice --tool-call-parser mistral --chat-template tool_chat_template_mistral.jinja
and send the following streaming request to /v1/chat/completions
. with auto tool choice, and a limit on max_tokens:
{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"tool_choice": "auto",
"max_tokens": 16,
"stream": true,
"messages": [
{
"role": "user",
"content": "What is the weather like in California?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"description": "The city, e.g. San Francisco, CA",
"type": "string"
},
"unit": {
"enum": ["celsius", "fahrenheit"],
"type": "string"
}
},
"required": ["location"]
}
}
}
]
}
I get the following chunks in the response that build a partial entry in tool_calls
, and do not have a finish_reason
returned:
data: {"id":"chat-4f2c1c18d3ec41f68f7574b54467a4ed","object":"chat.completion.chunk","created":1727995345,"model":"mistralai/Mistral-7B-Instruct-v0.3","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chat-4f2c1c18d3ec41f68f7574b54467a4ed","object":"chat.completion.chunk","created":1727995345,"model":"mistralai/Mistral-7B-Instruct-v0.3","choices":[{"index":0,"delta":{"tool_calls":[{"id":"chatcmpl-tool-ff07dfb8bd304874b1ad9d1556282709","type":"function","index":0,"function":{"name":"get_current_weather"}}]},"logprobs":null,"finish_reason":null}]}
data: [DONE]
But if I send the same request without streaming, the response has a finish_reason
but is not parsed for tool calling:
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "[TOOL_CALLS] [{\"name\": \"get_current_weather\", \"arguments\":",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
I understand that using max_tokens
with tool-use is an edge case, but I wonder if the behavior/responses should be made more consistent here.
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.