ChatGLM-6B [Feature] GPU下int4的推理速度，似乎还不及llama.cpp的CPU推理速度

[Feature] GPU下int4的推理速度，似乎还不及llama.cpp的CPU推理速度

Open lucasjinreal opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe.

llama.cpp 甚至可以推理int4的30B，ChatGLM的GPU 6B速度甚至比30B还慢，这是为何

Solutions

需要提速一下模型

Additional context

No response

Mar 17 '23 02:03 lucasjinreal

I think one possible reason is the tokenizer. The vocab size of llama is 32000, but size of chatglm-6b is about 150000.

Mar 18 '23 15:03 DongqiShen

@DongqiShen could be a possible reason. But why the tokenizer various so much, llama also include Chinese tokens, even more multi language tokens.

Mar 19 '23 02:03 lucasjinreal

@jinfagang The member's explanation is here. Not know much about tokenizer, sry about that. However, from my tests, I think ChatGLM-6b behaves much better than LLaMa-7b in Chinese.

Mar 19 '23 04:03 DongqiShen

llama7B的中文续写质量就不说了吧。。基本上提不起来。测试完基本上就要删模型。初始模型还需要进行大量的微调。。除非哪家推出来一个中文的chatLlama，否则就是白搭。

Mar 20 '23 00:03 yaleimeng

@DongqiShen the comparasion is unfair, llama is not train specific for chat, alpaca might be more feasible for chat. My point is not about the performance though. The 6b is still slower than 13b if we only compares with speed imo.

Mar 20 '23 02:03 lucasjinreal

ChatGLM-6B ChatGLM-6B copied to clipboard

[Feature] GPU下int4的推理速度，似乎还不及llama.cpp的CPU推理速度

Is your feature request related to a problem? Please describe.

Solutions

Additional context

ChatGLM-6B
ChatGLM-6B copied to clipboard