ipex-llm
ipex-llm copied to clipboard
Qwen3-32B enable prefix caching error on tool call
server报错提示:
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] hidden_states = self.self_attn(
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen3.py", line 144, in forward
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] attn_output = self.attn(q, k, v)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/attention/layer.py", line 226, in forward
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] return self.impl.forward(self, query, key, value,
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/attention/backends/ipex_attn.py", line 772, in forward
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] assert output[:num_prefill_query_tokens].shape == out.shape
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=97049) ERROR 08-05 14:52:46 [worker_base.py:620] NameError: name 'num_prefill_query_tokens' is not defined
(WrapperWithLoadBit pid=97091) WARNING 08-05 14:52:17 [_logger.py:68] Pin memory is not supported on XPU. [repeated 2x across cluster]
复现步骤: 镜像名称: intelanalytics/ipex-llm-serving-xpu:0.8.3-b21
- vllm server 启动命令:
#!/bin/bash
export ZE_AFFINITY_MASK=4,5,6,7
MODEL_PATH=${MODEL_PATH:-"/llm/models/Qwen3-32B/"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"ui-tars"}
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-4}
MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-4096}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-4096}
LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"}
PORT=${PORT:-8007}
echo "Starting service with model: $MODEL_PATH"
echo "Served model name: $SERVED_MODEL_NAME"
echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE"
echo "Max num sequences: $MAX_NUM_SEQS"
echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS"
echo "Max model length: $MAX_MODEL_LEN"
echo "Load in low bit: $LOAD_IN_LOW_BIT"
echo "Port: $PORT"
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export FI_PROVIDER=shm
export TORCH_LLM_ALLREDUCE=0
export CCL_WORKER_COUNT=2 # On BMG, set CCL_WORKER_COUNT=1; otherwise, internal-oneccl will not function properly
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
# export CCL_DG2_USM=1 # Needed on Core to enable USM (Shared Memory GPUDirect). Xeon supports P2P and doesn't need this.
export VLLM_USE_V1=0 # Used to select between V0 and V1 engine
export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT # Ensures low-bit info is used for MoE; otherwise, IPEX's default MoE will be used
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $SERVED_MODEL_NAME \
--port $PORT \
--model $MODEL_PATH \
--trust-remote-code \
--block-size 8 \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit $LOAD_IN_LOW_BIT \
--max-model-len $MAX_MODEL_LEN \
--max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
--max-num-seqs $MAX_NUM_SEQS \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--disable-async-output-proc \
--enable-prefix-caching \
--distributed-executor-backend ray \
--enable-auto-tool-choice --tool-call-parser hermes
- server启动之后client测试代码如下:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://xxxx:8007/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model_name = "ui-tars"
import json
def get_current_temperature(location: str, unit: str = "celsius"):
"""Get current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, and the unit in a dict
"""
return {
"temperature": 26.1,
"location": location,
"unit": unit,
}
def get_temperature_date(location: str, date: str, unit: str = "celsius"):
"""Get temperature at a location and date.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
date: The date to get the temperature for, in the format "Year-Month-Day".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, the date and the unit in a dict
"""
return {
"temperature": 25.9,
"location": location,
"date": date,
"unit": unit,
}
def get_function_by_name(name):
if name == "get_current_temperature":
return get_current_temperature
if name == "get_temperature_date":
return get_temperature_date
TOOLS = [
{
"type": "function",
"function": {
"name": "get_current_temperature",
"description": "Get current temperature at a location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required": ["location"],
},
},
},
{
"type": "function",
"function": {
"name": "get_temperature_date",
"description": "Get temperature at a location and date.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"date": {
"type": "string",
"description": 'The date to get the temperature for, in the format "Year-Month-Day".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required": ["location", "date"],
},
},
},
]
MESSAGES = [
{"role": "user", "content": "What's the temperature in San Francisco now? How about tomorrow? Current Date: 2024-09-30."},
]
tools = TOOLS
messages = MESSAGES
response = client.chat.completions.create(
model=model_name,
messages=messages,
tools=tools,
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
#"chat_template_kwargs": {"enable_thinking": False} # default to True
},
)
print(response)
执行第一遍client代码不会报错,相同的client代码执行第二次server端报错
This issue is caused by prefix-caching code error, and it‘s fixed by this pr.