inference Xinference部署qwen2-vl为啥生成速度那么慢啊，感觉不正常

System Info / 系統信息

无

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[X] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

当前最新版

The command used to start Xinference / 用以启动 xinference 的命令

页面

Reproduction / 复现过程

无

Expected behavior / 期待表现

变快

Sep 25 '24 06:09 kandada

你咋安装的。为什么我们都报错 #2361

Sep 25 '24 14:09 goactiongo

你咋安装的。为什么我们都报错 #2361

升级一下transformers。pip install git+https://github.com/huggingface/transformers

不过虽然能运行，同样显卡条件的情况下，相比其他部署方式，qwen2-vl在xinference中的生成速度非常慢。不知道怎么解决。。。

Sep 26 '24 02:09 kandada

你咋安装的。为什么我们都报错 #2361

升级一下transformers。pip install git+https://github.com/huggingface/transformers

不过虽然能运行，同样显卡条件的情况下，相比其他部署方式，qwen2-vl在xinference中的生成速度非常慢。不知道怎么解决。。。

@XprobeBot 对，而且用几下就爆显存。qwen2-vl-instruct-gptq-int8

Sep 29 '24 09:09 monk-after-90s

需要安装flash-attention，或者使用vllm推理

Sep 29 '24 09:09 zhanghaiqiangshigezhu

你咋安装的。为什么我们都报错 #2361

升级一下transformers。pip install git+https://github.com/huggingface/transformers 不过虽然能运行，同样显卡条件的情况下，相比其他部署方式，qwen2-vl在xinference中的生成速度非常慢。不知道怎么解决。。。

@XprobeBot 对，而且用几下就爆显存。qwen2-vl-instruct-gptq-int8 需要安装flash-attention，或者使用vllm推理

Sep 29 '24 09:09 zhanghaiqiangshigezhu

需要安装flash-attention，或者使用vllm推理

xinference不支持vllm部署Qwen2-vl。我安装了flash-attention，不行哎

Sep 29 '24 13:09 monk-after-90s

This issue is stale because it has been open for 7 days with no activity.

Oct 06 '24 19:10 github-actions[bot]

同，一张图片的回答花了一分多钟，显存占用25G~27G。vllm提示 Model qwen2-vl-instruct cannot be run on engine vllm.不知道是xinference不支持还是vllm引擎不支持

Oct 11 '24 02:10 Yanhuanjin

qwen2-vl-instruct 直接用 vllm 跑吧

Oct 12 '24 01:10 Valdanitooooo

System Info / 系統信息

无

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker

[x] pip install / 通过 pip install 安装

[ ] installation from source / 从源码安装

Version info / 版本信息

当前最新版

The command used to start Xinference / 用以启动 xinference 的命令

页面

Reproduction / 复现过程

无

Expected behavior / 期待表现

变快

xinference还不支持vllm 加载qwen2-vl 直接用vllm把 vllm起服务，可以正常使用

Oct 12 '24 05:10 GXKIM

This issue is stale because it has been open for 7 days with no activity.

Oct 19 '24 19:10 github-actions[bot]

用XInference推理qwen2-vl-7b-instruct特别慢，大概0.x token/s，用的是pt模型；而用MS-SWIFT部署推理qwen2-vl-7b-instruct很快，大概6 tokens/s，用的是pt模型。不太确定是什么原因~

Oct 22 '24 09:10 thinkthinking

This issue is stale because it has been open for 7 days with no activity.

Nov 06 '24 19:11 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Nov 12 '24 19:11 github-actions[bot]