lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Bug] Tool Call Parser: 开启tool call支持后,stream模式下ChoiceDeltaToolCall解析异常

Open ExenVitor opened this issue 9 months ago • 4 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

主要问题

模型:QwQ-32B-AWQ 服务启动命令:lmdeploy serve api_server /models/QwQ-32B-AWQ --cache-max-entry-count 0.7 --max-batch-size 32 --model-format awq --tool-call-parser qwen --chat-template qwen2d5

请求时开启stream模式,可以观察到tool call部分解析异常。 使用下面提供的测试代码,在user prompt不同时会出现不同的异常现象。


情景一:预期返回一个tool call,可以正常执行结束,但<tool_call>...</tool_call>开头部分被解析为普通的delta content

User Prompt:What's the weather like in Boston today?

Print Log:

==== Text ====
<think>
Okay, the user is asking about the weather in Boston today. Let me check the tools provided. There's a function called get_current_weather that requires city, state, and unit. The user mentioned Boston but didn't specify the state. Boston is in Massachusetts, so the state abbreviation is MA. They also didn't mention the unit, but since they're asking in a general context, maybe they want Fahrenheit as it's commonly used in the US. I should use the function with city: Boston, state: MA, and unit: fahrenheit. Wait, the unit's enum includes both celsius and fahrenheit. The user might prefer one, but since it's not specified, I'll default to the common unit for that region. Let me confirm the parameters again. The required fields are all there now. Alright, time to make the tool call.
</think>

<tool_call>
{"name": "get_current_weather", "arguments": {"city":
==== Tool Call ====
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-c9ZCzB9b9NikK5kjPBXpFG', function=ChoiceDeltaToolCallFunction(arguments='{"city": "', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-b3Dod7JdrjxAD988dEGUHD', function=ChoiceDeltaToolCallFunction(arguments='Boston', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-3BFSJf9bLyVY4WsQwAYFs6', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-QqGVq9WQVPCUDob6QGCCC2', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-eui6aECNphTCCYrfJWWxED', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-H6xpLbDLsFboPF2VAHhkeZ', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-LyjvddKd4jJTwc99ECJWiw', function=ChoiceDeltaToolCallFunction(arguments='", "state": "', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-mNJuviVE9aUr2BTyaLJZgP', function=ChoiceDeltaToolCallFunction(arguments='MA', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-gsx3gAhKUYtaCX3aUo27BL', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-NhAcfuf8DFH7RZpAoGgtyh', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-SteUs6tvKSYQaMNCTStcU2', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-2AjKS3pFasLaaNuMkVtFcX', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-boMkszCDr2AomnXBFbY9BR', function=ChoiceDeltaToolCallFunction(arguments='", "unit": "', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-eempeGFjTmNCws77CwoUYJ', function=ChoiceDeltaToolCallFunction(arguments='f', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-xzJPBn83KXzQVDCgLfFeLx', function=ChoiceDeltaToolCallFunction(arguments='ahrenheit', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-NinxatcmwJ4sJjhN4Gpsak', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-YCHPFKjgHMvnFcZb6SHEvw', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-dmemrJ2weW3kGx85SDLfr7', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')

可以看到

<tool_call>
{"name": "get_current_weather", "arguments": {"city":

被错误解析到了content部分,同时中间有不少ChoiceDeltaToolCall的arguments都为空字符串,合并所有ChoiceDeltaToolCall后并不能得到一个合法的json


情景二:预期返回多个tool call,在接收ChoiceDeltaToolCall过程中抛出异常

User Prompt:What's the weather like in Boston and New York today?

Client raised exception:

ChoiceDelta(content=None, function_call=None, role='assistant', tool_calls=[ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-8nW5XPwPAJ6F65MYtm4egJ', function=ChoiceDeltaToolCallFunction(arguments='ahrenheit', name=None), type='function')], reasoning_content=None)
ChoiceDelta(content=None, function_call=None, role='assistant', tool_calls=[ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-nSqe5YAyuenm8pke7qtSky', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')], reasoning_content=None)
ChoiceDelta(content=None, function_call=None, role='assistant', tool_calls=[ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-iCZ7iECmDtktbQ7m3P9TxH', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')], reasoning_content=None)
ChoiceDelta(content=None, function_call=None, role='assistant', tool_calls=[ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-2EG9Zd9973eZQoJHoP8z4n', function=ChoiceDeltaToolCallFunction(arguments='', name=None), type='function')], reasoning_content=None)
....
File ~/work/comm_news_sentiment/.venv/lib/python3.10/site-packages/httpx/_client.py:126, in BoundSyncStream.__iter__(self)
    125 def __iter__(self) -> typing.Iterator[bytes]:
--> 126     for chunk in self._stream:
    127         yield chunk

File ~/work/comm_news_sentiment/.venv/lib/python3.10/site-packages/httpx/_transports/default.py:112, in ResponseStream.__iter__(self)
    111 def __iter__(self) -> typing.Iterator[bytes]:
--> 112     with map_httpcore_exceptions():
    113         for part in self._httpcore_stream:
    114             yield part

File /usr/lib/python3.10/contextlib.py:153, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    151     value = typ()
    152 try:
--> 153     self.gen.throw(typ, value, traceback)
    154 except StopIteration as exc:
    155     # Suppress StopIteration *unless* it's the same exception that
    156     # was passed to throw().  This prevents a StopIteration
    157     # raised inside the "with" statement from being suppressed.
    158     return exc is not value

File ~/work/comm_news_sentiment/.venv/lib/python3.10/site-packages/httpx/_transports/default.py:86, in map_httpcore_exceptions()
     83     raise
     85 message = str(exc)
---> 86 raise mapped_exc(message) from exc

RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

Server raised exception:

....
lmdeploy_server_deploy-api_server-1     |     await self.app(scope, receive, send)
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
lmdeploy_server_deploy-api_server-1     |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
lmdeploy_server_deploy-api_server-1     |     raise exc
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
lmdeploy_server_deploy-api_server-1     |     await app(scope, receive, sender)
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
lmdeploy_server_deploy-api_server-1     |     await response(scope, receive, send)
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 262, in __call__
lmdeploy_server_deploy-api_server-1     |     with collapse_excgroups():
lmdeploy_server_deploy-api_server-1     |   File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
lmdeploy_server_deploy-api_server-1     |     self.gen.throw(typ, value, traceback)
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
lmdeploy_server_deploy-api_server-1     |     raise exc
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 266, in wrap
lmdeploy_server_deploy-api_server-1     |     await func()
lmdeploy_server_deploy-api_server-1     |   File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 246, in stream_response
lmdeploy_server_deploy-api_server-1     |     async for chunk in self.body_iterator:
lmdeploy_server_deploy-api_server-1     |   File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 449, in completion_stream_generator
lmdeploy_server_deploy-api_server-1     |     tool_delta = VariableInterface.tool_parser.extract_tool_calls_streaming(
lmdeploy_server_deploy-api_server-1     |   File "/opt/lmdeploy/lmdeploy/serve/openai/tool_parser/qwen2d5_parser.py", line 60, in extract_tool_calls_streaming
lmdeploy_server_deploy-api_server-1     |     text, action = new_delta.split(self.tool_start_token)
lmdeploy_server_deploy-api_server-1     | ValueError: too many values to unpack (expected 2)

关闭stream mode可以得到正确的响应:

==== tool_calls ====
[
ChatCompletionMessageToolCall(id='chatcmpl-PrXJNNBoMxoojK5fesxYQ5', function=Function(arguments='{"city": "Boston", "state": "MA", "unit": "fahrenheit"}', name='get_current_weather'), type='function'), 
ChatCompletionMessageToolCall(id='chatcmpl-2ZHRtRE6KCMHARNis6h2x6', function=Function(arguments='{"city": "New York", "state": "NY", "unit": "fahrenheit"}', name='get_current_weather'), type='function')
]

额外问题:

可以观察到同一个tool call的ChoiceDeltaToolCall.id不一致,这个行为是否为设计缺陷?对比vllm和sglang的结果,它们在处理id时都可以体现出某种规律,使得client侧在合并ChoiceDeltaToolCall为ChatCompletionMessageToolCall时,能为每个tool call保持一个唯一id。例如:

vllm:

# tool call 1 begin:
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-c9ZCzB9b9NikK5kjPBXpFG', function=ChoiceDeltaToolCallFunction(arguments='', name="func_name"), type='function')
ChoiceDeltaToolCall(index=0, id='', function=ChoiceDeltaToolCallFunction(arguments='xxxxx', name=None), type='function')
...
# tool call 2 begin:
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-HfzZPEVfQgvgpSJccb8Vr8', function=ChoiceDeltaToolCallFunction(arguments='', name="func_name"), type='function')
ChoiceDeltaToolCall(index=0, id='', function=ChoiceDeltaToolCallFunction(arguments='xxxxx', name=None), type='function')
...

sglang:

# tool call 1 begin:
ChoiceDeltaToolCall(index=0, id='0', function=ChoiceDeltaToolCallFunction(arguments='', name="func_name"), type='function')
ChoiceDeltaToolCall(index=0, id='0', function=ChoiceDeltaToolCallFunction(arguments='xxxxx', name=None), type='function')
...
# tool call 2 begin:
ChoiceDeltaToolCall(index=0, id='1', function=ChoiceDeltaToolCallFunction(arguments='', name="func_name"), type='function')
ChoiceDeltaToolCall(index=0, id='1', function=ChoiceDeltaToolCallFunction(arguments='xxxxx', name=None), type='function')
...

Reproduction

streaming mode test:

client = OpenAI(api_key="None", base_url=f"http://192.168.16.11:23333/v1")
model_name = client.models.list().data[0].id

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city to find the weather for, e.g. 'San Francisco'",
                    },
                    "state": {
                        "type": "string",
                        "description": "the two-letter abbreviation for the state that the city is"
                        " in, e.g. 'CA' which would mean 'California'",
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city", "state", "unit"],
            },
        },
    }
]

def get_messages():
    return [
        {
            "role": "user",
            "content": "What's the weather like in Boston today?",
        }
    ]

messages = get_messages()

response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.65,
    top_p=0.95,
    max_tokens=4096,
    stream=True,  # Enable streaming
    tools=tools,
)

texts = ""
reasoning = ""
tool_calls = []
name = ""
arguments = ""
for chunk in response_stream:
    if hasattr(chunk.choices[0].delta, "content") and chunk.choices[0].delta.content:
        texts += chunk.choices[0].delta.content
    if hasattr(chunk.choices[0].delta, "reasoning_content") and chunk.choices[0].delta.reasoning_content:
        reasoning += chunk.choices[0].delta.reasoning_content
    if chunk.choices[0].delta.tool_calls:
        tool_calls.append(chunk.choices[0].delta.tool_calls[0])
    print(chunk.choices[0].delta)
    
print("==== Reasoning ====")
print(reasoning)

print("==== Text ====")
print(texts)

print("==== Tool Call ====")
for tool_call in tool_calls:
    print(tool_call)

Non-streaming mode test:

# Non-streaming mode test
response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.65,
    top_p=0.95,
    max_tokens=1024,
    stream=False,  # Non-streaming
    tools=tools,
)
print("Non-stream response:")
print(response_non_stream)
print("==== content ====")
print(response_non_stream.choices[0].message.content)
print("==== tool_calls ====")
print(response_non_stream.choices[0].message.tool_calls)

Environment

sys.platform: linux
Python: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 4090 D
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.5.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1  (built against CUDA 12.4)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.20.1+cu121
LMDeploy: 0.7.2.post1+81c815e
transformers: 4.49.0
gradio: 5.22.0
fastapi: 0.115.11
pydantic: 2.10.6
triton: 3.1.0
NVIDIA Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-7     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback


ExenVitor avatar Apr 09 '25 10:04 ExenVitor

这个问题我也遇到了,开启--tool-call-parser参数后,如果是QwQ-32B-AWQ或者DeepSeek-R1-Distill-Qwen-32B-AWQ,用流式输出在Cherry Studio就会一直不停重复,换成api调用,tool_call就会在content中,function_call=None,tool_calls=None

lkicesky avatar Apr 11 '25 13:04 lkicesky

+1 一样的问题,lobechat 用不起来

XYZliang avatar Apr 16 '25 01:04 XYZliang

+1 一样的问题,lobechat 用不起来

可以试试这个二开项目,集成了 lmdeploy,vllm,sglang https://github.com/shell-nlp/gpt_server 专门优化了 tool call

shell-nlp avatar Apr 16 '25 09:04 shell-nlp

同样的问题


LMDeploy: openmmlab/lmdeploy:v0.7.3-cu12

  • k8s yaml文件中的启动参数
command: ["/bin/sh", "-c"]
args: ["lmdeploy serve api_server Qwen2.5-14B-Instruct --tp 4 --server-port 8000 --cache-max-entry-count 0.8 --log-level INFO --tool-call-parser qwen"]
  • 流式:异常
  • 非流式:正常

流式(多个tool call)

在流式响应解析时出现问题

lmdeploy server 错误日志

2025-04-24 08:31:31,401 - lmdeploy - ERROR - qwen2d5_parser.py:114 - INVARIANT - impossible to have arguments reset mid-arguments
[TM][INFO] ------------------------- step = 510 -------------------------
[TM][INFO] ------------------------- step = 520 -------------------------
[TM][INFO] ------------------------- step = 530 -------------------------
[TM][INFO] ------------------------- step = 540 -------------------------
[TM][INFO] ------------------------- step = 550 -------------------------
[TM][INFO] ------------------------- step = 560 -------------------------
[TM][INFO] ------------------------- step = 570 -------------------------
[TM][INFO] ------------------------- step = 580 -------------------------
[TM][INFO] ------------------------- step = 590 -------------------------
[TM][INFO] ------------------------- step = 600 -------------------------
[TM][INFO] ------------------------- step = 610 -------------------------
ERROR:    Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/opt/py3/lib/python3.10/site-packages/starlette/_utils.py", line 76, in collapse_excgroups
  |     yield
  |   File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 263, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/opt/py3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/opt/py3/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    |     result = await app(  # type: ignore[func-returns-value]
    |   File "/opt/py3/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    |     return await self.app(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    |     await super().__call__(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/applications.py", line 112, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    |     raise exc
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    |     await self.app(scope, receive, _send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    |     await self.app(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 714, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 734, in app
    |     await route.handle(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    |     await self.app(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    |     await response(scope, receive, send)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 262, in __call__
    |     with collapse_excgroups():
    |   File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    |     self.gen.throw(typ, value, traceback)
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
    |     raise exc
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 266, in wrap
    |     await func()
    |   File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 246, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 450, in completion_stream_generator
    |     tool_delta = VariableInterface.tool_parser.extract_tool_calls_streaming(
    |   File "/opt/lmdeploy/lmdeploy/serve/openai/tool_parser/qwen2d5_parser.py", line 60, in extract_tool_calls_streaming
    |     text, action = new_delta.split(self.tool_start_token)
    | ValueError: too many values to unpack (expected 2)
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/py3/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/py3/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 714, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 734, in app
    await route.handle(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    await response(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 262, in __call__
    with collapse_excgroups():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/py3/lib/python3.10/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 266, in wrap
    await func()
  File "/opt/py3/lib/python3.10/site-packages/starlette/responses.py", line 246, in stream_response
    async for chunk in self.body_iterator:
  File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 450, in completion_stream_generator
    tool_delta = VariableInterface.tool_parser.extract_tool_calls_streaming(
  File "/opt/lmdeploy/lmdeploy/serve/openai/tool_parser/qwen2d5_parser.py", line 60, in extract_tool_calls_streaming
    text, action = new_delta.split(self.tool_start_token)

非流式(多个tool call)

{
  "id": "4",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "chatcmpl-vGVY8p8GNB7qZFepkriYz8",
            "function": {
              "arguments": "{\"location\": \"San Francisco, California, USA\"}",
              "name": "get_current_temperature"
            },
            "type": "function"
          },
          {
            "id": "chatcmpl-zQEfiSYAoJjwFZbBFmKGk9",
            "function": {
              "arguments": "{\"location\": \"San Francisco, California, USA\", \"date\": \"2024-10-01\"}",
              "name": "get_temperature_date"
            },
            "type": "function"
          }
        ],
        "reasoning_content": null
      }
    }
  ],
  "created": 1745483341,
  "model": "Qwen2.5-14B-Instruct",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 66,
    "prompt_tokens": 412,
    "total_tokens": 478,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  }
}

测试脚本

import json

from openai import OpenAI

def get_current_temperature(location: str, unit: str = "celsius"):
    """Get current temperature at a location.

    Args:
        location: The location to get the temperature for, in the format "City, State, Country".
        unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])

    Returns:
        the temperature, the location, and the unit in a dict
    """
    return {
        "temperature": 26.1,
        "location": location,
        "unit": unit,
    }


def get_temperature_date(location: str, date: str, unit: str = "celsius"):
    """Get temperature at a location and date.

    Args:
        location: The location to get the temperature for, in the format "City, State, Country".
        date: The date to get the temperature for, in the format "Year-Month-Day".
        unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])

    Returns:
        the temperature, the location, the date and the unit in a dict
    """
    return {
        "temperature": 25.9,
        "location": location,
        "date": date,
        "unit": unit,
    }


def get_function_by_name(name):
    if name == "get_current_temperature":
        return get_current_temperature
    if name == "get_temperature_date":
        return get_temperature_date

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_current_temperature",
            "description": "Get current temperature at a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": 'The location to get the temperature for, in the format "City, State, Country".',
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": 'The unit to return the temperature in. Defaults to "celsius".',
                    },
                },
                "required": ["location"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_temperature_date",
            "description": "Get temperature at a location and date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": 'The location to get the temperature for, in the format "City, State, Country".',
                    },
                    "date": {
                        "type": "string",
                        "description": 'The date to get the temperature for, in the format "Year-Month-Day".',
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": 'The unit to return the temperature in. Defaults to "celsius".',
                    },
                },
                "required": ["location", "date"],
            },
        },
    },
]
MESSAGES = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30"},
    {"role": "user",  "content": "What's the temperature in San Francisco now? How about tomorrow?"},
]

tools = TOOLS
messages = MESSAGES[:]

openai_api_key = "EMPTY"
openai_api_base = "http://10.224.0.37:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model_name = "Qwen2.5-14B-Instruct"

stream=False
#stream=True
if stream:
    stream = client.chat.completions.create(
        model=model_name,
        messages=messages,
        tools=tools,
        temperature=0.7,
        top_p=0.8,
        max_tokens=512,
        extra_body={
            "repetition_penalty": 1.05,
        },
        stream=True
    )

    final_tool_calls = {}

    for chunk in stream:
        delta = chunk.choices[0].delta
        print(f">>> delta.tool_calls: {delta.tool_calls}")

else:
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        tools=tools,
        temperature=0.7,
        top_p=0.8,
        max_tokens=512,
        extra_body={
            "repetition_penalty": 1.05,
        }
    )
    print(json.dumps(response.model_dump(), indent=2, ensure_ascii=False))

simonwei97 avatar Apr 24 '25 08:04 simonwei97

发现同样的问题+1

warlockedward avatar May 26 '25 08:05 warlockedward

@RunningLeon Could you help me on this?

AllentDan avatar May 28 '25 07:05 AllentDan

同样的问题,我直接在流式场景下,把 qwen2d5_parser.py 里面的解析逻辑改成,等 </tool_server> 存在的时候再做解析,并将 api_server 里面的 tool message is None 情况下直接 continue,暂时解决了这个问题。

ywx217 avatar May 30 '25 09:05 ywx217

@ywx217 Hi, welcome PR to fix it.

RunningLeon avatar May 30 '25 12:05 RunningLeon

Hi, @RunningLeon

I just submitted a PR for this issue, plz review

ywx217 avatar Jun 05 '25 09:06 ywx217