Wang, Jian4

Results 44 comments of Wang, Jian4

This issue is caused by prefix-caching code error, and it‘s fixed by this [pr](https://github.com/analytics-zoo/vLLM-ARC-X/pull/17/files).

You can use image `intelanalytics/ipex-llm-serving-xpu 0.8.3-b22` to test again. This problem is not encountered on b22 because the sdpa method update on qwen2.5-vl.

There are indeed some problems with `asym_int4` on running Llama-70B, why don't you use `sym_int4` or `woq_int4`?

@buffliu You can add `--allowed-local-media-path /llm/models/media` on starting vllm service, and then you can send a vedio like: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2.5-VL-7B-Instruct",...

Will be fixed by this [pr](https://github.com/intel/ipex-llm/pull/13178)

Vllm 0.5.4 does not support qwen2-vl model yet. We will support it in the future 0.6.1 version.

Yes, even the official version of vllm 0.5.4 does not support it until 0.6.1.

It is recommended to run Llama Qwen and chatglm models. for example: `Llama-2-7b-chat-hf Qwen1.5-7B-Chat chatglm3-6b`.

The b21 image actually exits bug on chunked-prefill and it will be fixed in next version. But it seems that the multimodal model can't not use chunked-prefill on v0 engine...

I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?