Qwen2.5-VL-3B-Instruct can not infer a picture
Qwen2.5-VL-3B-Instruct can not infer a picture
HW: 1 Intel ARC770
SW:
Ubuntu 22.04
Docker Image: intelanalytics/ipex-llm-serving-xpu 0.8.3-b21
Model:Qwen2.5-VL-3B-Instruct
Precision: fp8
Steps to reproduce the error:
- Run the command to run the docker container <docker run -dit --rm --net=host --privileged --device=/dev/dri --name=CUE -v /home/shawn/:/llm/shawn/ -e no_proxy=localhost,127.0.0.1 -e http_proxy=$http_proxy -e https_proxy=$http_proxy --shm-size="16g" --entrypoint /bin/bash intelanalytics/ipex-llm-serving-xpu:0.8.3-b21>
- Run bash start-vllm-service.sh to start serving model
- Go to the container inside : bash exec -it CUE bash
- Run below curl commands
- Qwen2.5-VL-3B-Instruct can infer text ,but can not infer a picutre.
root@ARC770:/llm# curl http://localhost:8000/v1/models {"object":"list","data":[{"id":"Qwen2.5-VL-3B-Instruct","object":"model","created":1755567553,"owned_by":"vllm","root":"/llm/shawn/models/Qwen/Qwen2.5-VL-3B-Instruct","parent":null,"max_model_len":2000,"permission":[{"id":"modelperm-231f1d293b3f4ebb9aaba20ed58670eb","object":"model_permission","created":1755567553,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
root@ARC770:/llm#
root@ARC770:/llm#
root@ARC770:/llm#
root@ARC770:/llm# curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen2.5-VL-3B-Instruct",
"messages": [
{"role": "user", "content": "介绍下美国的首都"}
],
"max_tokens": 50
}'
{"id":"chatcmpl-584b3684fff6469fa69c59cef14f9f65","object":"chat.completion","created":1755567576,"model":"Qwen2.5-VL-3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"美国的首都是华盛顿特区,位于美国东北部地区。它是一座现代化的大都市,拥有许多重要的历史和文化景点。\n\n华盛顿特区是美国政府的中心,包括国会大厦、白宫、林肯纪念堂等著名建筑","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":24,"total_tokens":74,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null}
root@ARC770:/llm#
root@ARC770:/llm#
root@ARC770:/llm#
root@ARC770:/llm#
root@ARC770:/llm#
root@ARC770:/llm# curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "图片里有什么"},
{
"type": "image_url",
"image_url": {
"url": "file:/llm/shawn/girl_2.jpg"
}
}
]
}
],
"max_tokens": 100
}'
{"id":"chatcmpl-57b466d3e2f5477e94bb50d10ef666df","object":"chat.completion","created":1755567613,"model":"Qwen2.5-VL-3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":415,"total_tokens":515,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}
root@ARC770:/llm#
start-vllm-service.sh
#!/bin/bash MODEL_PATH=${MODEL_PATH:-"/llm/shawn/models/Qwen/Qwen2.5-VL-3B-Instruct"} SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen2.5-VL-3B-Instruct"} TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}
MAX_NUM_SEQS=${MAX_NUM_SEQS:-256} MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-3000} MAX_MODEL_LEN=${MAX_MODEL_LEN:-2000} LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"} PORT=${PORT:-8000}
echo "Starting service with model: $MODEL_PATH" echo "Served model name: $SERVED_MODEL_NAME" echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE" echo "Max num sequences: $MAX_NUM_SEQS" echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS" echo "Max model length: $MAX_MODEL_LEN" echo "Load in low bit: $LOAD_IN_LOW_BIT" echo "Port: $PORT"
export USE_XETLA=OFF export SYCL_CACHE_PERSISTENT=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export FI_PROVIDER=shm export TORCH_LLM_ALLREDUCE=0
export CCL_WORKER_COUNT=2 # On BMG, set CCL_WORKER_COUNT=1; otherwise, internal-oneccl will not function properly export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 export CCL_SAME_STREAM=1 export CCL_BLOCKING_WAIT=0
export VLLM_USE_V1=0 # Used to select between V0 and V1 engine export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT # Ensures low-bit info is used for MoE; otherwise, IPEX's default MoE will be used
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $SERVED_MODEL_NAME
--port $PORT
--model $MODEL_PATH
--trust-remote-code
--block-size 8
--gpu-memory-utilization 0.90
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit $LOAD_IN_LOW_BIT
--max-model-len $MAX_MODEL_LEN
--max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS
--max-num-seqs $MAX_NUM_SEQS
--tensor-parallel-size $TENSOR_PARALLEL_SIZE
--disable-async-output-proc
--distributed-executor-backend ray
--allowed-local-media-path /llm/shawn
You can use image intelanalytics/ipex-llm-serving-xpu 0.8.3-b22 to test again. This problem is not encountered on b22 because the sdpa method update on qwen2.5-vl.
intelanalytics/ipex-llm-serving-xpu 0.8.3-b22
Does it support qwen2.5 32b VL (https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) and AWQ (https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)? I have an issue in ollama (https://github.com/intel/ipex-llm/issues/13293) with this model
You can use image
intelanalytics/ipex-llm-serving-xpu 0.8.3-b22to test again. This problem is not encountered on b22 because the sdpa method update on qwen2.5-vl. Yes , it works intelanalytics/ipex-llm-serving-xpu 0.8.3-b22 image,
What did you fix in b22 for the issue ?