ipex-llm Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM

Environment: Platform: 6548N+1 ARC770 Docker Image: servicing script:

Error info: 1.With compression weight SYM_INT4 failed. 2.Has tried the parameter "gpu-memory-utilization" from 0.65 to 0.95 with step 0.05 could not work.

Error log: 1.Servicing side error log:

Sep 18 '24 02:09 dukelee111

I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?

Sep 18 '24 05:09 hzjane

It's not encountered when doing benchmark, starting vllm could succeed.

Sep 18 '24 07:09 dukelee111

Cannot reproduce Steps:

start docker:

#!/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
export CONTAINER_NAME=junwang-vllm54-issue220

docker rm -f $CONTAINER_NAME
sudo docker run -itd \
        --net=host \
        --device=/dev/dri \
        --name=$CONTAINER_NAME \
        -v /home/intel/LLM:/llm/models/ \
        -v /home/intel/junwang:/workspace \
        -e no_proxy=localhost,127.0.0.1 \
        --shm-size="16g" \
        $DOCKER_IMAGE

start serve:

#!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat/"
served_model_name="Qwen1.5-14B-Chat"

export no_proxy=localhost,127.0.0.1

source /opt/intel/oneapi/setvars.sh
source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8001 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 256 \
  -tp 1 \
  --max-num-seqs 64
  #-tp 2 #--enable-prefix-caching --enable-chunked-prefill #--tokenizer-pool-size 8 --swap-space 8

curl script:

curl http://localhost:8001/v1/completions                 -H "Content-Type: application/json"             -d '{
                  "model": "Qwen1.5-14B-Chat",
                  "prompt": "San Francisco is a",
                  "max_tokens": 128
}'

result
1. offline
2. online

Sep 18 '24 08:09 ACupofAir

closed since no update for a long time

Dec 11 '24 07:12 glorysdj

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4).