ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

Inference speed benchmark?

Open wizd opened this issue 1 year ago • 2 comments

Cool model! I'll have a try. I'd like to know 5 token/s minimal hardware requirement.

wizd avatar Mar 14 '23 18:03 wizd

没太注意。可供参考的数据是:在默认参数(没有量化,half精度)下,RTX3090推理大部分问题在10s上下。但回复通常有几百字。如果输出特别短,大约2~3秒就完成了。 不过在最低要求(INT4量化)下,至少也需要有10GB显存,因为虽然一开始启动只占用6GB,但对话几轮就会再上升50%左右。搭配足够显存的GPU应该推理速度大体是能接受的。

yaleimeng avatar Mar 15 '23 03:03 yaleimeng

It depends on your hardware, the model precision, the context length and generation length. I only experimented on A100 with FP16 and the speed is about 20-30 tokens/s in the leading . Others are welcome to share their benchmarking result in this issue. Please specify the environment and settings.

duzx16 avatar Mar 15 '23 16:03 duzx16

可不可以比较一下在GPU比如T4 V100等的per token latency image 参考链接:https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu

xuguozhi avatar Mar 30 '23 03:03 xuguozhi