Qwen3 部署了Qwen1.5-32B-Chat-GPTQ-Int4可以运行，但出现了CUDA extension not installed，推理速度很慢

Apr 09 '24 06:04 qingwu11

同问.

Apr 09 '24 07:04 Dssssds

qwen1时遇到过这个问题把auto-gptq，optimum换个版本就可以了

Apr 09 '24 07:04 WSR-wsr

确实特别慢，如何解决？

Apr 09 '24 07:04 yihaozuifan

感觉只能堆配置了.

Apr 09 '24 12:04 Dssssds

重新安装auto-gptq，optimum，注意auto-gptq根据你的cuda版本选择对应的安装方式

Apr 10 '24 06:04 kratorado

32B-Chat-AWQ 在A100 40G上跑回复差不多相同的内容,时间大约是20秒, 14B-Chat-AWQ 在4090 24G上跑, 回复差不多6秒内.

是不是我需要做什么配置才能让32B-Chat-AWQ 推理速度快一些?

Apr 11 '24 01:04 weiminw

for the installation of auto-gptq, we advise you to install from source (git clone the repo and run pip install -e .) or you will meet "CUDA not installed" issue.

Apr 13 '24 16:04 JustinLin610

确实特别慢，如何解决？

这是我的requirements.txt内容，你参考下，我跑qwen/Qwen1___5-32B-Chat-GPTQ-Int4速度没问题 requirements.txt

Apr 15 '24 10:04 zengraoli

Qwen1.5-14B-Chat-GPTQ-Int4 实测：transformers==4.38.2 auto-gptq=0.6.0 速度没问题

Apr 17 '24 03:04 YYGe01

32B-Chat-AWQ回复慢，vllm4部署，有什么解决办法吗

Apr 23 '24 14:04 heart18z

Qwen1.5-14B-Chat-GPTQ-Int4 实测：transformers==4.38.2 auto-gptq=0.6.0 速度没问题

我也遇到相同的问题，原来的auto-gptq版本为0.7.0，降级为0.6.0后问题解决

Apr 26 '24 07:04 ATP-BME

为何不是用AWQ呢精度要比GPTQ高一丢丢 VLLM部署很容易

Apr 29 '24 07:04 SongXiaoMao

为何不是用AWQ呢精度要比GPTQ高一丢丢 VLLM部署很容易

这个问题你得问问官方了

Apr 29 '24 08:04 zengraoli

确实特别慢，如何解决？

这是我的requirements.txt内容，你参考下，我跑qwen/Qwen1___5-32B-Chat-GPTQ-Int4速度没问题 requirements.txt

感谢！用了你的配置快多了

Jun 27 '24 02:06 lijiaoyang