vllm [Feature]: Consider parallel_tool_calls parameter at the API level

🚀 The feature, motivation and pitch

Currently, there is a parallel_tool_calls field that is part of the ChatCompletionRequest pydantic class. However, this field is only there for being compatible with OpenAI's API.

In other words, it's not being used at all according to the documentation or the code:

# NOTE this will be ignored by VLLM -- the model determines the behavior
parallel_tool_calls: Optional[bool] = False

Would it be possible to consider implementing the logic behind this field for different model families. For instance, in the case of llama3.1-8b-insturct, tool calling works, but the model ends up returning three tool calls instead of one by one. This makes me lose compatibility with frameworks like LangGraph.

Here's an example request and response:

Request

{
  "messages": [
    {
      "content": "You are a helpful assistant tasked with performing arithmetic on a set of inputs.",
      "role": "system"
    },
    {
      "content": "Add 3 and 4. Multiply the output by 2. Divide the output by 5",
      "role": "user"
    }
  ],
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "stream": false,
  "n": 1,
  "temperature": 0.0,
  "max_tokens": 256,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "add",
        "description": "Adds a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "multiply",
        "description": "Multiply a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "divide",
        "description": "Divide a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    }
  ],
  "parallel_tool_calls": false
}

Response

{
  "ChatCompletion": {
    "id": "chat-32cb47446c5b471eba5c91be1755811e",
    "choices": [
      {
        "finish_reason": "tool_calls",
        "index": 0,
        "logprobs": null,
        "message": {
          "content": null,
          "refusal": null,
          "role": "assistant",
          "function_call": null,
          "tool_calls": [
            {
              "id": "chatcmpl-tool-f8c832f4a42445f899a229063004cae9",
              "function": {
                "arguments": '{"a": 3, "b": 4}',
                "name": "add"
              },
              "type": "function"
            },
            {
              "id": "chatcmpl-tool-4b44f70f7dde47d0820f8a3b9018b897",
              "function": {
                "arguments": '{"a": 7, "b": 2}',
                "name": "multiply"
              },
              "type": "function"
            },
            {
              "id": "chatcmpl-tool-d897bd7ecb4b44e59eb718aff21cbfa8",
              "function": {
                "arguments": '{"a": 14, "b": 5}',
                "name": "divide"
              },
              "type": "function"
            }
          ]
        },
        "stop_reason": 128008
      }
    ],
    "created": 1729149431,
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "chat.completion",
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
      "completion_tokens": 67,
      "prompt_tokens": 466,
      "total_tokens": 533,
      "completion_tokens_details": null,
      "prompt_tokens_details": null
    },
    "prompt_logprobs": null
  }
}

Even if I wanted to do a posterior call to the model using the three tool calls at the same time, it will complain with an error of:

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'This model only supports single tool-calls at once!', 'type': 'BadRequestError', 'param': None, 'code': 400}

Which comes from this llama3_json template.

Thank you Team!.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Oct 17 '24 07:10 lucasalvarezlacasa

+1

Oct 18 '24 12:10 Ugo06

+2

Oct 30 '24 11:10 frei-x

+3

Oct 31 '24 17:10 Ritesh2910

+4

Nov 06 '24 10:11 AmaleshV

+5

Nov 07 '24 03:11 act-shinee

+6

Nov 08 '24 05:11 vipulgote1999

+7

Nov 08 '24 23:11 Kev-Y-Huang

+8

This could really help smaller models that struggle with multiple tools

Nov 20 '24 21:11 erdaltoprak

+9

Dec 17 '24 10:12 akshayb264

+10

Dec 31 '24 03:12 weiminw

+11

Feb 07 '25 09:02 LianxinGao

+12

Feb 20 '25 15:02 shruthiR-fauna

I am facing the same issue. Despite setting parallel_tool_calls=False the Qwen 2.5 72B model makes parallel calls where I want to update the state variable in sequential steps. I want to filter a dataframe step-by-step but the parallel tool calls are making it difficult to do so.

My current workaround:

def use_tool_or_complex_filter_condition(state: dict[str, Any], messages_key: str = "messages") -> Literal["tools", "__end__"]:

    if isinstance(state, dict) and (messages := state.get(messages_key, [])):
        ai_message = messages[-1]
    else:
        raise ValueError(f"No messages found in input state to tool_edge: {state}")

    if hasattr(ai_message, "tool_calls") and len(ai_message.tool_calls) > 0:
        # Get only the first tool call
        ai_message.additional_kwargs['tool_calls'] = [ai_message.additional_kwargs['tool_calls'][0]]
        ai_message.tool_calls= [ai_message.tool_calls[0]]
        return "tools"
    else:
        return "__end__"

Don't forget to show the message history between the AI and Tools to help the model output the next tool call and avoid getting stuck in a loop.

Mar 18 '25 09:03 venki-lfc

+13

Apr 16 '25 21:04 zoltan-fedor

+14

Apr 23 '25 06:04 haitwang-cloud

+15

Jun 04 '25 02:06 volcano98

+16

Jun 04 '25 10:06 zacksiri

+17

Jun 08 '25 01:06 LukasHaas

+18

Jun 17 '25 17:06 segtio

+19

Aug 22 '25 20:08 ramipellumbi

There's a fairly easy way to approach this which is to just drop anything after the first tool call when sending the responses back if parallel_tool_calls is False. It's a pretty small code change but also a bit hacky, as we'll still spend the GPU cycles generating multiple tool calls just to drop any past the first. I have a working prototype of this in manual testing, but need to wire it into automated tests and ensure I've covered all the potential code paths here.

Then there's a bit more invasive change where we only generate a single tool call. I haven't prototyped that yet, but it doesn't look nearly as straightforward.

So, in the next few days, I'll open a PR to wire this in at the API level so that this parameter actually controls whether we generate more than 1 tool call per request or not. And I'll open an issue to track optimizing that so that we actually stop generating tool calls after the first as a later follow-up, since that requires deeper changes.

Oct 03 '25 21:10 bbrowning

Once #26233 merges, parallel_tool_calls=False will work with vLLM, and should make using vLLM with frameworks like LangGraph easier since you'll be able to restrict any model to only a single tool call generated at a time.

Oct 06 '25 13:10 bbrowning

vllm vllm copied to clipboard

[Feature]: Consider parallel_tool_calls parameter at the API level

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

vllm
vllm copied to clipboard