chatglm.cpp
chatglm.cpp copied to clipboard

Published 20 hours ago •

Reame
Issues

chatglm2使用pyfastllm推理速度变慢

Open xuxingya opened this issue 1 year ago • 0 comments

环境：

Debian11 CUDA11.7 gcc 10.2.1 显卡: T4, A10

模型：

chatglm2-6b-f16

问题

原版模型转换成flm后大小13G。不论是cuda编译的fastllm还是pyfastlm，在T4卡和A10卡上，推理速度都比原版慢了1/2左右。

def decode(idx:int, content: bytearray):
    content = content.decode(encoding="utf-8", errors="replace")
    return content
LLM_TYPE = pyfastllm.get_llm_type(FASTLLM_PATH)
print(f"llm model: {LLM_TYPE}")
self.model = pyfastllm.create_llm(FASTLLM_PATH)
self.model.warmup()
outputs = self.model.response(prompts[0], decode, self.config)

Oct 09 '23 09:10 xuxingya