ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Qwen2.5-VL-3B-Instruct can not infer a picture

Open shawn9977 opened this issue 4 months ago • 3 comments

Qwen2.5-VL-3B-Instruct can not infer a picture

HW: 1 Intel ARC770

SW: Ubuntu 22.04 Docker Image: intelanalytics/ipex-llm-serving-xpu 0.8.3-b21
Model:Qwen2.5-VL-3B-Instruct
Precision: fp8

Steps to reproduce the error:

  1. Run the command to run the docker container <docker run -dit --rm --net=host --privileged --device=/dev/dri --name=CUE -v /home/shawn/:/llm/shawn/ -e no_proxy=localhost,127.0.0.1 -e http_proxy=$http_proxy -e https_proxy=$http_proxy --shm-size="16g" --entrypoint /bin/bash intelanalytics/ipex-llm-serving-xpu:0.8.3-b21>
  2. Run bash start-vllm-service.sh to start serving model
  3. Go to the container inside : bash exec -it CUE bash
  4. Run below curl commands
  5. Qwen2.5-VL-3B-Instruct can infer text ,but can not infer a picutre.

root@ARC770:/llm# curl http://localhost:8000/v1/models {"object":"list","data":[{"id":"Qwen2.5-VL-3B-Instruct","object":"model","created":1755567553,"owned_by":"vllm","root":"/llm/shawn/models/Qwen/Qwen2.5-VL-3B-Instruct","parent":null,"max_model_len":2000,"permission":[{"id":"modelperm-231f1d293b3f4ebb9aaba20ed58670eb","object":"model_permission","created":1755567553,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

root@ARC770:/llm# root@ARC770:/llm# root@ARC770:/llm# root@ARC770:/llm# curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "Qwen2.5-VL-3B-Instruct", "messages": [ {"role": "user", "content": "介绍下美国的首都"} ], "max_tokens": 50 }' {"id":"chatcmpl-584b3684fff6469fa69c59cef14f9f65","object":"chat.completion","created":1755567576,"model":"Qwen2.5-VL-3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"美国的首都是华盛顿特区,位于美国东北部地区。它是一座现代化的大都市,拥有许多重要的历史和文化景点。\n\n华盛顿特区是美国政府的中心,包括国会大厦、白宫、林肯纪念堂等著名建筑","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":24,"total_tokens":74,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null}

root@ARC770:/llm# root@ARC770:/llm# root@ARC770:/llm# root@ARC770:/llm# root@ARC770:/llm# root@ARC770:/llm# curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "Qwen2.5-VL-3B-Instruct", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "图片里有什么"}, { "type": "image_url", "image_url": { "url": "file:/llm/shawn/girl_2.jpg" } } ] } ], "max_tokens": 100 }' {"id":"chatcmpl-57b466d3e2f5477e94bb50d10ef666df","object":"chat.completion","created":1755567613,"model":"Qwen2.5-VL-3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":415,"total_tokens":515,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}

root@ARC770:/llm#

start-vllm-service.sh

#!/bin/bash MODEL_PATH=${MODEL_PATH:-"/llm/shawn/models/Qwen/Qwen2.5-VL-3B-Instruct"} SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen2.5-VL-3B-Instruct"} TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}

MAX_NUM_SEQS=${MAX_NUM_SEQS:-256} MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-3000} MAX_MODEL_LEN=${MAX_MODEL_LEN:-2000} LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"} PORT=${PORT:-8000}

echo "Starting service with model: $MODEL_PATH" echo "Served model name: $SERVED_MODEL_NAME" echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE" echo "Max num sequences: $MAX_NUM_SEQS" echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS" echo "Max model length: $MAX_MODEL_LEN" echo "Load in low bit: $LOAD_IN_LOW_BIT" echo "Port: $PORT"

export USE_XETLA=OFF export SYCL_CACHE_PERSISTENT=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export FI_PROVIDER=shm export TORCH_LLM_ALLREDUCE=0

export CCL_WORKER_COUNT=2 # On BMG, set CCL_WORKER_COUNT=1; otherwise, internal-oneccl will not function properly export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 export CCL_SAME_STREAM=1 export CCL_BLOCKING_WAIT=0

export VLLM_USE_V1=0 # Used to select between V0 and V1 engine export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT # Ensures low-bit info is used for MoE; otherwise, IPEX's default MoE will be used

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $SERVED_MODEL_NAME
--port $PORT
--model $MODEL_PATH
--trust-remote-code
--block-size 8
--gpu-memory-utilization 0.90
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit $LOAD_IN_LOW_BIT
--max-model-len $MAX_MODEL_LEN
--max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS
--max-num-seqs $MAX_NUM_SEQS
--tensor-parallel-size $TENSOR_PARALLEL_SIZE
--disable-async-output-proc
--distributed-executor-backend ray
--allowed-local-media-path /llm/shawn

shawn9977 avatar Aug 19 '25 02:08 shawn9977

You can use image intelanalytics/ipex-llm-serving-xpu 0.8.3-b22 to test again. This problem is not encountered on b22 because the sdpa method update on qwen2.5-vl.

hzjane avatar Aug 19 '25 06:08 hzjane

intelanalytics/ipex-llm-serving-xpu 0.8.3-b22

Does it support qwen2.5 32b VL (https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) and AWQ (https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)? I have an issue in ollama (https://github.com/intel/ipex-llm/issues/13293) with this model

savvadesogle avatar Aug 19 '25 06:08 savvadesogle

You can use image intelanalytics/ipex-llm-serving-xpu 0.8.3-b22 to test again. This problem is not encountered on b22 because the sdpa method update on qwen2.5-vl. Yes , it works intelanalytics/ipex-llm-serving-xpu 0.8.3-b22 image,

What did you fix in b22 for the issue ?

shawn9977 avatar Aug 20 '25 01:08 shawn9977