是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Qwen-72B-Chat-Int4使用A100 40G*2 进行推理，时间长达257s，这种情况正常么

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

import time
from modelscope import AutoTokenizer, AutoModelForCausalLM

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("/opt/models/model_repository/Qwen-72B-Chat-Int4", revision='master', trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "/opt/models/model_repository/Qwen-72B-Chat-Int4", revision='master',
    device_map="auto",
    trust_remote_code=True
).eval()
start = time.time()
response, history = model.chat(tokenizer, "讲一个小故事", history=None)
end = time.time()
print(response)
print("infer time:", end-start)

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Dec 12 '23 03:12 zhudongwork

输出有多长呢？参考README中速度说明看看 https://github.com/QwenLM/Qwen#inference-performance （额外考虑下transformers推理的话，多卡速度要比单卡慢。）

Dec 12 '23 04:12 jklj077

输出也不是很长：

在一个寒冷的冬日，小明走在回家的路上。他看见一只小鸟掉在了地上，冻得瑟瑟发抖。小明心疼极了，他把小鸟捧在手心里，用自己的体温温暖它。过了一会儿，小鸟渐渐恢复了活力。它感激地看了小明一眼，然后飞上了天空。小明感到非常开心，因为他做了一件好事。他知道，即使是一件小小的事情，也能给世界带来一些温暖和善意。从那天起，小明更加爱护大自然，关心身边的生命。他的善良和爱心感动了许多人，也让他变得更加自信和快乐。这个小故事告诉我们，无论我们身处何处，都应该保持善良和关爱之心。只有这样，我们的世界才会变得更加美好。

大概153token

Dec 12 '23 04:12 zhudongwork

遇到了同样的问题，4张v100，让他讲个故事花了10多分钟...，该如何解决

Dec 13 '23 01:12 boquanzhou

一样，docker部署的72b-int4模型，单卡和双卡推理都非常慢

Dec 13 '23 15:12 BUJIDAOVS

docker 部署比直接在物理机上运行慢很多 docker 推理耗时46秒物理机耗费时2秒

物理机环境

5*v100（16G） Python 3.10.13 NVIDIA-SMI 535.129.03
Driver Version: 535.129.03
CUDA Version: 12.2
PyTorch Version: 2.1.2

Dec 19 '23 05:12 sheiy

@sheiy @zhudongwork @BUJIDAOVS @boquanzhou 您好，如果您是在Docker中部署72B量化版本模型的话，推理速度变慢是因为之前docker镜像中的auto-gptq版本存在问题（可参考此issue）目前最新版本的docker镜像已修复此问题，可以拉取最新镜像后再尝试一下。

Dec 25 '23 04:12 fyabc

@fyabc 感谢

Dec 25 '23 06:12 sheiy

单卡A100*80G推理都很慢，配置都改成最大限制60G了

Apr 22 '24 07:04 terence-wu

Qwen
Qwen copied to clipboard

Qwen-72B-Chat-Int4推理时间

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

物理机环境

Qwen Qwen copied to clipboard

Qwen-72B-Chat-Int4推理时间

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

物理机环境

Qwen
Qwen copied to clipboard