vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Consider parallel_tool_calls parameter at the API level

Open lucasalvarezlacasa opened this issue 1 year ago • 1 comments

🚀 The feature, motivation and pitch

Currently, there is a parallel_tool_calls field that is part of the ChatCompletionRequest pydantic class. However, this field is only there for being compatible with OpenAI's API.

In other words, it's not being used at all according to the documentation or the code:

# NOTE this will be ignored by VLLM -- the model determines the behavior
parallel_tool_calls: Optional[bool] = False

Would it be possible to consider implementing the logic behind this field for different model families. For instance, in the case of llama3.1-8b-insturct, tool calling works, but the model ends up returning three tool calls instead of one by one. This makes me lose compatibility with frameworks like LangGraph.

Here's an example request and response:

Request

{
  "messages": [
    {
      "content": "You are a helpful assistant tasked with performing arithmetic on a set of inputs.",
      "role": "system"
    },
    {
      "content": "Add 3 and 4. Multiply the output by 2. Divide the output by 5",
      "role": "user"
    }
  ],
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "stream": false,
  "n": 1,
  "temperature": 0.0,
  "max_tokens": 256,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "add",
        "description": "Adds a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "multiply",
        "description": "Multiply a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "divide",
        "description": "Divide a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    }
  ],
  "parallel_tool_calls": false
}

Response

{
  "ChatCompletion": {
    "id": "chat-32cb47446c5b471eba5c91be1755811e",
    "choices": [
      {
        "finish_reason": "tool_calls",
        "index": 0,
        "logprobs": null,
        "message": {
          "content": null,
          "refusal": null,
          "role": "assistant",
          "function_call": null,
          "tool_calls": [
            {
              "id": "chatcmpl-tool-f8c832f4a42445f899a229063004cae9",
              "function": {
                "arguments": '{"a": 3, "b": 4}',
                "name": "add"
              },
              "type": "function"
            },
            {
              "id": "chatcmpl-tool-4b44f70f7dde47d0820f8a3b9018b897",
              "function": {
                "arguments": '{"a": 7, "b": 2}',
                "name": "multiply"
              },
              "type": "function"
            },
            {
              "id": "chatcmpl-tool-d897bd7ecb4b44e59eb718aff21cbfa8",
              "function": {
                "arguments": '{"a": 14, "b": 5}',
                "name": "divide"
              },
              "type": "function"
            }
          ]
        },
        "stop_reason": 128008
      }
    ],
    "created": 1729149431,
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "chat.completion",
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
      "completion_tokens": 67,
      "prompt_tokens": 466,
      "total_tokens": 533,
      "completion_tokens_details": null,
      "prompt_tokens_details": null
    },
    "prompt_logprobs": null
  }
}

Even if I wanted to do a posterior call to the model using the three tool calls at the same time, it will complain with an error of:

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'This model only supports single tool-calls at once!', 'type': 'BadRequestError', 'param': None, 'code': 400}

Which comes from this llama3_json template.

Thank you Team!.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

lucasalvarezlacasa avatar Oct 17 '24 07:10 lucasalvarezlacasa

+1

Ugo06 avatar Oct 18 '24 12:10 Ugo06

+2

frei-x avatar Oct 30 '24 11:10 frei-x

+3

Ritesh2910 avatar Oct 31 '24 17:10 Ritesh2910

+4

AmaleshV avatar Nov 06 '24 10:11 AmaleshV

+5

act-shinee avatar Nov 07 '24 03:11 act-shinee

+6

vipulgote1999 avatar Nov 08 '24 05:11 vipulgote1999

+7

Kev-Y-Huang avatar Nov 08 '24 23:11 Kev-Y-Huang

+8

This could really help smaller models that struggle with multiple tools

erdaltoprak avatar Nov 20 '24 21:11 erdaltoprak

+9

akshayb264 avatar Dec 17 '24 10:12 akshayb264

+10

weiminw avatar Dec 31 '24 03:12 weiminw

+11

LianxinGao avatar Feb 07 '25 09:02 LianxinGao

+12

shruthiR-fauna avatar Feb 20 '25 15:02 shruthiR-fauna

I am facing the same issue. Despite setting parallel_tool_calls=False the Qwen 2.5 72B model makes parallel calls where I want to update the state variable in sequential steps. I want to filter a dataframe step-by-step but the parallel tool calls are making it difficult to do so.

My current workaround:

def use_tool_or_complex_filter_condition(state: dict[str, Any], messages_key: str = "messages") -> Literal["tools", "__end__"]:

    if isinstance(state, dict) and (messages := state.get(messages_key, [])):
        ai_message = messages[-1]
    else:
        raise ValueError(f"No messages found in input state to tool_edge: {state}")

    if hasattr(ai_message, "tool_calls") and len(ai_message.tool_calls) > 0:
        # Get only the first tool call
        ai_message.additional_kwargs['tool_calls'] = [ai_message.additional_kwargs['tool_calls'][0]]
        ai_message.tool_calls= [ai_message.tool_calls[0]]
        return "tools"
    else:
        return "__end__"

Don't forget to show the message history between the AI and Tools to help the model output the next tool call and avoid getting stuck in a loop.

venki-lfc avatar Mar 18 '25 09:03 venki-lfc

+13

zoltan-fedor avatar Apr 16 '25 21:04 zoltan-fedor

+14

haitwang-cloud avatar Apr 23 '25 06:04 haitwang-cloud

+15

volcano98 avatar Jun 04 '25 02:06 volcano98

+16

zacksiri avatar Jun 04 '25 10:06 zacksiri

+17

LukasHaas avatar Jun 08 '25 01:06 LukasHaas

+18

segtio avatar Jun 17 '25 17:06 segtio

+19

ramipellumbi avatar Aug 22 '25 20:08 ramipellumbi

There's a fairly easy way to approach this which is to just drop anything after the first tool call when sending the responses back if parallel_tool_calls is False. It's a pretty small code change but also a bit hacky, as we'll still spend the GPU cycles generating multiple tool calls just to drop any past the first. I have a working prototype of this in manual testing, but need to wire it into automated tests and ensure I've covered all the potential code paths here.

Then there's a bit more invasive change where we only generate a single tool call. I haven't prototyped that yet, but it doesn't look nearly as straightforward.

So, in the next few days, I'll open a PR to wire this in at the API level so that this parameter actually controls whether we generate more than 1 tool call per request or not. And I'll open an issue to track optimizing that so that we actually stop generating tool calls after the first as a later follow-up, since that requires deeper changes.

bbrowning avatar Oct 03 '25 21:10 bbrowning

Once #26233 merges, parallel_tool_calls=False will work with vLLM, and should make using vLLM with frameworks like LangGraph easier since you'll be able to restrict any model to only a single tool call generated at a time.

bbrowning avatar Oct 06 '25 13:10 bbrowning