ChatGLM-6B 使用huggingface + cahtglm6b单机多卡推理加速有更好的办法吗？

使用huggingface + cahtglm6b单机多卡推理加速有更好的办法吗？

Open LivinLuo1993 opened this issue 1 year ago • 4 comments

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

实测主页util.py中的 from utils import load_model_on_gpus model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4) 效果与https://huggingface.co/docs/transformers/perf_infer_gpu_one中 max_memory_mapping = {0: "30GB", 1: "30GB", 2: "30GB", 3: "30GB"} model = AutoModel.from_pretrained( "THUDM/chatglm-6b", trust_remote_code=True, device_map='auto', max_memory=max_memory_mapping ).half()

推理速度几乎一模一样

Expected Behavior

No response

Steps To Reproduce

None

Environment

None

Anything else?

No response

Jun 21 '23 10:06 LivinLuo1993

不知道官方是否能适配到如

text-generation-inference
vllm 这样的推理引擎项目上去呢。

Jun 22 '23 03:06 BrightXiaoHan

请问脚本如何单机多卡推理？

Jun 23 '23 03:06 VictoryBlue

该脚本的作用是在多张小显存的显卡上使用，不是用多张显卡加速

Jun 25 '23 08:06 duzx16

@duzx16 意思是一个模型一张显存放不下，只好放在多张显存上嘛？请问您是否知道如何在chatglm上使用数据并行训练模型呢？

Jun 25 '23 08:06 VictoryBlue

请问您有遇到过这个问题吗：Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Jul 05 '23 07:07 xiaojidaner

我之前加载模型的代码是：model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto').half().cuda() 提示的问题和你一样，后来我将.cuda()去掉后，这么问题就解决了。你参考一下。

Aug 24 '23 08:08 yingzhang0709

ChatGLM-6B ChatGLM-6B copied to clipboard

使用huggingface + cahtglm6b单机多卡推理加速有更好的办法吗？

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard