chatglm.cpp 首个token推理速度，比Python版本还慢

首个token推理速度，比Python版本还慢

Open ml2tao opened this issue 2 years ago • 0 comments

你好，非常感谢作者的工作和无私奉献通过对比我发现以下两个问题： 1.chatglm-6b的chatglm.cpp首个token的推理速度比Python版本慢了好几倍，特别是输入长度大于100。 2. 输入长度超过1000字符，chatglm.cpp的结果更差，输出的长度比Python版本短了50%以上。机器型号：CPU型号：Intel(R) Xeon(R) Platinum 8475B，CPU核数：16，内存：60Gi

模型精度	模型推理版本	输入长度(字)	输出长度(token)	第一个token耗时	非流式输出总耗时	总耗时	剩余token平均耗时
float16	Python	32	215	0.7445s	34.1369s	35.3059s	0.1615s
float16	Python	257	306	1.3713s	50.179s	50.4559s	0.1654s
float16	Python	512	269	2.8002s	46.7511s	48.0962s	0.169s
float16	Python	1024	227	4.7105s	44.6898s	44.3965s	0.1756s
float16	Python	24	282	0.5863s	46.4415s	45.34s	0.1593s
float16	chatglm.cpp	32	217	0.5475s		21.1821s	0.0955s
float16	chatglm.cpp	257	308	4.0019s		33.6029s	0.0964s
float16	chatglm.cpp	512	271	9.6735s		35.9047s	0.0972s
float16	chatglm.cpp	1024	98	15.9248s		25.3531s	0.0972s
float16	chatglm.cpp	24	284	0.5491s		27.6705s	0.0958s

Jul 11 '23 14:07 ml2tao

chatglm.cpp chatglm.cpp copied to clipboard

首个token推理速度，比Python版本还慢

chatglm.cpp
chatglm.cpp copied to clipboard