ipex-llm
ipex-llm copied to clipboard
Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4).
Environment:
Platform: 6548N+1 ARC770
Docker Image:
servicing script:
Error info: 1.With compression weight SYM_INT4 failed. 2.Has tried the parameter "gpu-memory-utilization" from 0.65 to 0.95 with step 0.05 could not work.
Error log:
1.Servicing side error log:
I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?
It's not encountered when doing benchmark, starting vllm could succeed.
Cannot reproduce Steps:
- start docker:
#!/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
export CONTAINER_NAME=junwang-vllm54-issue220
docker rm -f $CONTAINER_NAME
sudo docker run -itd \
--net=host \
--device=/dev/dri \
--name=$CONTAINER_NAME \
-v /home/intel/LLM:/llm/models/ \
-v /home/intel/junwang:/workspace \
-e no_proxy=localhost,127.0.0.1 \
--shm-size="16g" \
$DOCKER_IMAGE
- start serve:
#!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat/"
served_model_name="Qwen1.5-14B-Chat"
export no_proxy=localhost,127.0.0.1
source /opt/intel/oneapi/setvars.sh
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8001 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4096 \
--max-num-seqs 256 \
-tp 1 \
--max-num-seqs 64
#-tp 2 #--enable-prefix-caching --enable-chunked-prefill #--tokenizer-pool-size 8 --swap-space 8
- curl script:
curl http://localhost:8001/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen1.5-14B-Chat",
"prompt": "San Francisco is a",
"max_tokens": 128
}'
- result
-
offline
-
online
-
closed since no update for a long time