ChatGLM-6B
ChatGLM-6B copied to clipboard
Inference speed benchmark?
Cool model! I'll have a try. I'd like to know 5 token/s minimal hardware requirement.
没太注意。可供参考的数据是:在默认参数(没有量化,half精度)下,RTX3090推理大部分问题在10s上下。但回复通常有几百字。如果输出特别短,大约2~3秒就完成了。 不过在最低要求(INT4量化)下,至少也需要有10GB显存,因为虽然一开始启动只占用6GB,但对话几轮就会再上升50%左右。搭配足够显存的GPU应该推理速度大体是能接受的。
It depends on your hardware, the model precision, the context length and generation length. I only experimented on A100 with FP16 and the speed is about 20-30 tokens/s in the leading . Others are welcome to share their benchmarking result in this issue. Please specify the environment and settings.
可不可以比较一下在GPU比如T4 V100等的per token latency
参考链接:https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu