vllm
vllm copied to clipboard
[Feature]: Consider parallel_tool_calls parameter at the API level
🚀 The feature, motivation and pitch
Currently, there is a parallel_tool_calls field that is part of the ChatCompletionRequest pydantic class. However, this field is only there for being compatible with OpenAI's API.
In other words, it's not being used at all according to the documentation or the code:
# NOTE this will be ignored by VLLM -- the model determines the behavior
parallel_tool_calls: Optional[bool] = False
Would it be possible to consider implementing the logic behind this field for different model families. For instance, in the case of llama3.1-8b-insturct, tool calling works, but the model ends up returning three tool calls instead of one by one. This makes me lose compatibility with frameworks like LangGraph.
Here's an example request and response:
Request
{
"messages": [
{
"content": "You are a helpful assistant tasked with performing arithmetic on a set of inputs.",
"role": "system"
},
{
"content": "Add 3 and 4. Multiply the output by 2. Divide the output by 5",
"role": "user"
}
],
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"stream": false,
"n": 1,
"temperature": 0.0,
"max_tokens": 256,
"tools": [
{
"type": "function",
"function": {
"name": "add",
"description": "Adds a and b.",
"parameters": {
"properties": {
"a": {
"description": "first int",
"type": "integer"
},
"b": {
"description": "second int",
"type": "integer"
}
},
"required": ["a", "b"],
"type": "object"
}
}
},
{
"type": "function",
"function": {
"name": "multiply",
"description": "Multiply a and b.",
"parameters": {
"properties": {
"a": {
"description": "first int",
"type": "integer"
},
"b": {
"description": "second int",
"type": "integer"
}
},
"required": ["a", "b"],
"type": "object"
}
}
},
{
"type": "function",
"function": {
"name": "divide",
"description": "Divide a and b.",
"parameters": {
"properties": {
"a": {
"description": "first int",
"type": "integer"
},
"b": {
"description": "second int",
"type": "integer"
}
},
"required": ["a", "b"],
"type": "object"
}
}
}
],
"parallel_tool_calls": false
}
Response
{
"ChatCompletion": {
"id": "chat-32cb47446c5b471eba5c91be1755811e",
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"logprobs": null,
"message": {
"content": null,
"refusal": null,
"role": "assistant",
"function_call": null,
"tool_calls": [
{
"id": "chatcmpl-tool-f8c832f4a42445f899a229063004cae9",
"function": {
"arguments": '{"a": 3, "b": 4}',
"name": "add"
},
"type": "function"
},
{
"id": "chatcmpl-tool-4b44f70f7dde47d0820f8a3b9018b897",
"function": {
"arguments": '{"a": 7, "b": 2}',
"name": "multiply"
},
"type": "function"
},
{
"id": "chatcmpl-tool-d897bd7ecb4b44e59eb718aff21cbfa8",
"function": {
"arguments": '{"a": 14, "b": 5}',
"name": "divide"
},
"type": "function"
}
]
},
"stop_reason": 128008
}
],
"created": 1729149431,
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 67,
"prompt_tokens": 466,
"total_tokens": 533,
"completion_tokens_details": null,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
}
Even if I wanted to do a posterior call to the model using the three tool calls at the same time, it will complain with an error of:
BadRequestError: Error code: 400 - {'object': 'error', 'message': 'This model only supports single tool-calls at once!', 'type': 'BadRequestError', 'param': None, 'code': 400}
Which comes from this llama3_json template.
Thank you Team!.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
+1
+2
+3
+4
+5
+6
+7
+8
This could really help smaller models that struggle with multiple tools
+9
+10
+11
+12
I am facing the same issue. Despite setting parallel_tool_calls=False the Qwen 2.5 72B model makes parallel calls where I want to update the state variable in sequential steps. I want to filter a dataframe step-by-step but the parallel tool calls are making it difficult to do so.
My current workaround:
def use_tool_or_complex_filter_condition(state: dict[str, Any], messages_key: str = "messages") -> Literal["tools", "__end__"]:
if isinstance(state, dict) and (messages := state.get(messages_key, [])):
ai_message = messages[-1]
else:
raise ValueError(f"No messages found in input state to tool_edge: {state}")
if hasattr(ai_message, "tool_calls") and len(ai_message.tool_calls) > 0:
# Get only the first tool call
ai_message.additional_kwargs['tool_calls'] = [ai_message.additional_kwargs['tool_calls'][0]]
ai_message.tool_calls= [ai_message.tool_calls[0]]
return "tools"
else:
return "__end__"
Don't forget to show the message history between the AI and Tools to help the model output the next tool call and avoid getting stuck in a loop.
+13
+14
+15
+16
+17
+18
+19
There's a fairly easy way to approach this which is to just drop anything after the first tool call when sending the responses back if parallel_tool_calls is False. It's a pretty small code change but also a bit hacky, as we'll still spend the GPU cycles generating multiple tool calls just to drop any past the first. I have a working prototype of this in manual testing, but need to wire it into automated tests and ensure I've covered all the potential code paths here.
Then there's a bit more invasive change where we only generate a single tool call. I haven't prototyped that yet, but it doesn't look nearly as straightforward.
So, in the next few days, I'll open a PR to wire this in at the API level so that this parameter actually controls whether we generate more than 1 tool call per request or not. And I'll open an issue to track optimizing that so that we actually stop generating tool calls after the first as a later follow-up, since that requires deeper changes.
Once #26233 merges, parallel_tool_calls=False will work with vLLM, and should make using vLLM with frameworks like LangGraph easier since you'll be able to restrict any model to only a single tool call generated at a time.