ChatGLM-6B Inference speed benchmark?

Inference speed benchmark?

Open wizd opened this issue 1 year ago • 2 comments

Cool model! I'll have a try. I'd like to know 5 token/s minimal hardware requirement.

Mar 14 '23 18:03 wizd

没太注意。可供参考的数据是：在默认参数（没有量化，half精度）下，RTX3090推理大部分问题在10s上下。但回复通常有几百字。如果输出特别短，大约2~3秒就完成了。不过在最低要求（INT4量化）下，至少也需要有10GB显存，因为虽然一开始启动只占用6GB，但对话几轮就会再上升50%左右。搭配足够显存的GPU应该推理速度大体是能接受的。

Mar 15 '23 03:03 yaleimeng

It depends on your hardware, the model precision, the context length and generation length. I only experimented on A100 with FP16 and the speed is about 20-30 tokens/s in the leading . Others are welcome to share their benchmarking result in this issue. Please specify the environment and settings.

Mar 15 '23 16:03 duzx16

可不可以比较一下在GPU比如T4 V100等的per token latency 参考链接：https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu

Mar 30 '23 03:03 xuguozhi

ChatGLM-6B ChatGLM-6B copied to clipboard

Inference speed benchmark?

ChatGLM-6B
ChatGLM-6B copied to clipboard