Qwen3-32B enable prefix caching error on tool call

Open Zjq9409 opened this issue 5 months ago • 1 comments

server报错提示：

(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     hidden_states = self.self_attn(
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]                     ^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen3.py", line 144, in forward
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     attn_output = self.attn(q, k, v)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]                   ^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/attention/layer.py", line 226, in forward
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     return self.impl.forward(self, query, key, value,
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]   File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/attention/backends/ipex_attn.py", line 772, in forward
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]     assert output[:num_prefill_query_tokens].shape == out.shape
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620]                    ^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] NameError: name 'num_prefill_query_tokens' is not defined
(WrapperWithLoadBit pid=97091) WARNING 08-05 14:52:17 [_logger.py:68] Pin memory is not supported on XPU. [repeated 2x across cluster]

复现步骤：镜像名称： intelanalytics/ipex-llm-serving-xpu:0.8.3-b21

vllm server 启动命令：

#!/bin/bash
export ZE_AFFINITY_MASK=4,5,6,7
MODEL_PATH=${MODEL_PATH:-"/llm/models/Qwen3-32B/"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"ui-tars"}
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-4}

MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-4096}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-4096}
LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"}
PORT=${PORT:-8007}

echo "Starting service with model: $MODEL_PATH"
echo "Served model name: $SERVED_MODEL_NAME"
echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE"
echo "Max num sequences: $MAX_NUM_SEQS"
echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS"
echo "Max model length: $MAX_MODEL_LEN"
echo "Load in low bit: $LOAD_IN_LOW_BIT"
echo "Port: $PORT"

export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export FI_PROVIDER=shm
export TORCH_LLM_ALLREDUCE=0

export CCL_WORKER_COUNT=2        # On BMG, set CCL_WORKER_COUNT=1; otherwise, internal-oneccl will not function properly
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
# export CCL_DG2_USM=1         # Needed on Core to enable USM (Shared Memory GPUDirect). Xeon supports P2P and doesn't need this.

export VLLM_USE_V1=0       # Used to select between V0 and V1 engine
export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT        # Ensures low-bit info is used for MoE; otherwise, IPEX's default MoE will be used

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $SERVED_MODEL_NAME \
  --port $PORT \
  --model $MODEL_PATH \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.95 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit $LOAD_IN_LOW_BIT \
  --max-model-len $MAX_MODEL_LEN \
  --max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
  --max-num-seqs $MAX_NUM_SEQS \
  --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
  --disable-async-output-proc \
  --enable-prefix-caching \
  --distributed-executor-backend ray \
  --enable-auto-tool-choice --tool-call-parser hermes

server启动之后client测试代码如下：

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://xxxx:8007/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model_name = "ui-tars"

import json

def get_current_temperature(location: str, unit: str = "celsius"):
    """Get current temperature at a location.

    Args:
        location: The location to get the temperature for, in the format "City, State, Country".
        unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])

    Returns:
        the temperature, the location, and the unit in a dict
    """
    return {
        "temperature": 26.1,
        "location": location,
        "unit": unit,
    }
def get_temperature_date(location: str, date: str, unit: str = "celsius"):
    """Get temperature at a location and date.

    Args:
        location: The location to get the temperature for, in the format "City, State, Country".
        date: The date to get the temperature for, in the format "Year-Month-Day".
        unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])

    Returns:
        the temperature, the location, the date and the unit in a dict
    """
    return {
        "temperature": 25.9,
        "location": location,
        "date": date,
        "unit": unit,
    }


def get_function_by_name(name):
    if name == "get_current_temperature":
        return get_current_temperature
    if name == "get_temperature_date":
        return get_temperature_date
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_current_temperature",
            "description": "Get current temperature at a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": 'The location to get the temperature for, in the format "City, State, Country".',
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": 'The unit to return the temperature in. Defaults to "celsius".',
                    },
                },
                "required": ["location"],
            },
        },
    },
{
        "type": "function",
        "function": {
            "name": "get_temperature_date",
            "description": "Get temperature at a location and date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": 'The location to get the temperature for, in the format "City, State, Country".',
                    },
                    "date": {
                        "type": "string",
                        "description": 'The date to get the temperature for, in the format "Year-Month-Day".',
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": 'The unit to return the temperature in. Defaults to "celsius".',
                    },
                },
                "required": ["location", "date"],
            },
        },
    },
]
MESSAGES = [
    {"role": "user",  "content": "What's the temperature in San Francisco now? How about tomorrow? Current Date: 2024-09-30."},
]

tools = TOOLS
messages = MESSAGES

response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    tools=tools,
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "repetition_penalty": 1.05,
        #"chat_template_kwargs": {"enable_thinking": False}  # default to True
    },
)
print(response)

执行第一遍client代码不会报错，相同的client代码执行第二次server端报错

Aug 05 '25 07:08 Zjq9409

This issue is caused by prefix-caching code error, and it‘s fixed by this pr.

Aug 06 '25 00:08 hzjane