ChatGLM-6B
ChatGLM-6B copied to clipboard
[Feature] GPU下int4的推理速度,似乎还不及llama.cpp的CPU推理速度
Is your feature request related to a problem? Please describe.
llama.cpp 甚至可以推理int4的30B,ChatGLM的GPU 6B速度甚至比30B还慢,这是为何
Solutions
需要提速一下模型
Additional context
No response
I think one possible reason is the tokenizer. The vocab size of llama is 32000, but size of chatglm-6b is about 150000.
@DongqiShen could be a possible reason. But why the tokenizer various so much, llama also include Chinese tokens, even more multi language tokens.
@jinfagang The member's explanation is here. Not know much about tokenizer, sry about that. However, from my tests, I think ChatGLM-6b behaves much better than LLaMa-7b in Chinese.
llama7B的中文续写质量就不说了吧。。基本上提不起来。测试完基本上就要删模型。 初始模型还需要进行大量的微调。。除非哪家推出来一个中文的chatLlama,否则就是白搭。
@DongqiShen the comparasion is unfair, llama is not train specific for chat, alpaca might be more feasible for chat. My point is not about the performance though. The 6b is still slower than 13b if we only compares with speed imo.