inference icon indicating copy to clipboard operation
inference copied to clipboard

Image upgrade to 0.12.1, running Qwen1.5-14B-Chat-GPTQ-Int4 is much slower compared to 0.11.0

Open WholeWorld-Timothy opened this issue 1 year ago • 5 comments

Describe the bug

Image upgrade to 0.12.1, running Qwen1.5-14B-Chat-GPTQ-Int4 is much slower compared to 0.11.0.

To Reproduce

docker image has been upgraded to 0.12.1, which is much slower when running Qwen1.5-14B-Chat-GPTQ-Int4 compared to 0.11.0.

Expected behavior

The number of tockens per second after the upgrade is the same as that before the upgrade.

Additional context

Our startup parameter configuration: image

WholeWorld-Timothy avatar Jun 17 '24 01:06 WholeWorld-Timothy

用vllma就快了,transfomer很慢,不知道为什么;旧版不知道用哪个,没 这个选项

worm128 avatar Jun 25 '24 13:06 worm128

显存有限,vllm用不起,应该说这是一个特性的变化,没有什么特别的改动么?没有改动就变慢了,那就奇怪了,有改动说一下改动在哪,我尝试单独打个版本改回来都成。

WholeWorld-Timothy avatar Jun 26 '24 00:06 WholeWorld-Timothy

显存有限,vllm用不起,应该说这是一个特性的变化,没有什么特别的改动么?没有改动就变慢了,那就奇怪了,有改动说一下改动在哪,我尝试单独打个版本改回来都成。

我都不知道原来那个什么版本,因为拉的镜像是lastest,原来旧版很快的,就是升级后分开了vllma和transfomer,vllma占用显存多了,但是速度快;transfomer就速度很慢,虽然占用显存和原来旧版一样。

worm128 avatar Jun 27 '24 06:06 worm128

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 06 '24 06:08 github-actions[bot]

@worm128 transfomer速度很慢的问题解决了吗,我这也遇到了。 之前用vllm一直挺快,但是现在显卡降级了,计算能力低于7.5用不了vllm了,就改成了Transformer,慢的像牛,问个你好也得半分钟才有响应。

Copilotes avatar Dec 05 '24 11:12 Copilotes