MiniCPM-V
MiniCPM-V copied to clipboard
int4和bffloat16推理时间问题(着急)
用如下代码分别测试MiniCPM-2B-dpo-bf16和MiniCPM-dpo-Int4两个模型,推理时间MiniCPM-2B-dpo-bf16有3秒多,MiniCPM-dpo-Int4有10秒以上,请问原因是啥?
When use int4 model, you should remove the torch_dtype=torch.float16 from AutoModelForCausalLM.from_pretrained(). And it's better to run fast with vllm, which support to run MiniCPM already.
When use int4 model, you should remove the torch_dtype=torch.float16 from AutoModelForCausalLM.from_pretrained(). And it's better to run fast with vllm, which support to run MiniCPM already.
多谢回答,模型加载的时候,我做了如下修改: path = '/home/sft_int4' tokenizer = AutoTokenizer.from_pretrained(path,trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(path,device_map="cuda", trust_remote_code=True).eval() start_time=time.time() responds, history = model.chat(tokenizer,input_text,temperature=0.3, top_p=0.8, repetition_penalty=1.05) print(responds) print(time.time()-start_time) 推理的时间还是有10秒多,比bfloat16还是要长不少,不知道什么原因,能否帮忙解释一下@iceflame89
When use int4 model, you should remove the torch_dtype=torch.float16 from AutoModelForCausalLM.from_pretrained(). And it's better to run fast with vllm, which support to run MiniCPM already.
多谢回答,模型加载的时候,我做了如下修改: path = '/home/sft_int4' tokenizer = AutoTokenizer.from_pretrained(path,trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(path,device_map="cuda", trust_remote_code=True).eval() start_time=time.time() responds, history = model.chat(tokenizer,input_text,temperature=0.3, top_p=0.8, repetition_penalty=1.05) print(responds) print(time.time()-start_time) 推理的时间还是有10秒多,比bfloat16还是要长不少,不知道什么原因,能否帮忙解释一下@iceflame89
您好,我使用类似的prompt格式对两种模型也进行了一次推理,在输出内容基本类似的情况下,int4模型用时5.5s左右,bf16模型用时2~3s左右。 这种现象是合理的,因为使用int4进行推理,只有参数压缩。在实际计算的过程中,目前还是会还原回浮点数进行实际计算,暂时还不支持量化计算。针对这个问题,大模型的现状应该都基本一致。
针对这方面的情况,请到我们的语言模型仓库 https://github.com/OpenBMB/MiniCPM 询问,我们对语言模型的优化主要位于那里。 如果有更多关于minicpm-v的问题,欢迎再询问。