vllm [Bug]: Different behavior with tool-use response parsing with streaming vs non-streaming when using max

[Bug]: Different behavior with tool-use response parsing with streaming vs non-streaming when using max_tokens

Open tjohnson31415 opened this issue 4 months ago • 0 comments

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

When I run the server with Mistral with auto-tool support

vllm serve mistralai/Mistral-7B-Instruct-v0.3 --enable-auto-tool-choice --tool-call-parser mistral --chat-template tool_chat_template_mistral.jinja

and send the following streaming request to /v1/chat/completions. with auto tool choice, and a limit on max_tokens:

{
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "tool_choice": "auto",
  "max_tokens": 16,
  "stream": true,
  "messages": [
    {
        "role": "user",
        "content": "What is the weather like in California?"
    }
  ],
  "tools": [
      {
          "type": "function",
          "function": {
              "name": "get_current_weather",
              "description": "Get the current weather in a given location",
              "parameters": {
                  "type": "object",
                  "properties": {
                      "location": {
                          "description": "The city, e.g. San Francisco, CA",
                          "type": "string"
                      },
                      "unit": {
                          "enum": ["celsius", "fahrenheit"],
                          "type": "string"
                      }
                  },
                  "required": ["location"]
              }
          }
      }
  ]
}

I get the following chunks in the response that build a partial entry in tool_calls, and do not have a finish_reason returned:

data: {"id":"chat-4f2c1c18d3ec41f68f7574b54467a4ed","object":"chat.completion.chunk","created":1727995345,"model":"mistralai/Mistral-7B-Instruct-v0.3","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chat-4f2c1c18d3ec41f68f7574b54467a4ed","object":"chat.completion.chunk","created":1727995345,"model":"mistralai/Mistral-7B-Instruct-v0.3","choices":[{"index":0,"delta":{"tool_calls":[{"id":"chatcmpl-tool-ff07dfb8bd304874b1ad9d1556282709","type":"function","index":0,"function":{"name":"get_current_weather"}}]},"logprobs":null,"finish_reason":null}]}

data: [DONE]

But if I send the same request without streaming, the response has a finish_reason but is not parsed for tool calling:

  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "[TOOL_CALLS] [{\"name\": \"get_current_weather\", \"arguments\":",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],

I understand that using max_tokens with tool-use is an edge case, but I wonder if the behavior/responses should be made more consistent here.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Oct 03 '24 21:10 tjohnson31415

vllm vllm copied to clipboard

[Bug]: Different behavior with tool-use response parsing with streaming vs non-streaming when using max_tokens

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

vllm
vllm copied to clipboard