Multi-GPU inference Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cuda:6!
System Info / 系統信息
system version: Ubuntu 20.04 LTS cuda version: 11.8 python version: 3.10.12 torch version: 2.3.0+cu118 xformers version: 0.0.26.post1+cu118
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- [ ] The official example scripts / 官方的示例脚本
- [X] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
Bug Info
.../huggingface/modules/transformers_modules/cogvlm2-llama3-chat-19B/visual.py", line 83, in forward output = mlp_input + mlp_output RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cuda:6!
Demo is above:
_import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer from torch.nn.parallel import DistributedDataParallel as DDP import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3, 4, 5, 6, 7" max_memory_mapping = {0: "20GB", 1: "20GB", 2: "20GB", 3: "20GB", 4: "20GB", 5: "20GB", 6: "20GB", 7: "20GB"}
#MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B" MODEL_PATH = "./cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained( MODEL_PATH, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, device_map='auto', max_memory=max_memory_mapping, load_in_8bit=False, torch_dtype=TORCH_TYPE, trust_remote_code=True, ).to(DEVICE).eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
while True: image_path = input("image path >>>>> ") if image_path == '': print('You did not enter image path, the following will be a plain text conversation.') image = None text_only_first_query = True else: image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("Human:")
if query == "clear":
break
if image is None:
if text_only_first_query:
query = text_only_template.format(query)
text_only_first_query = False
else:
old_prompt = ''
for _, (old_query, response) in enumerate(history):
old_prompt += old_query + " " + response + "\n"
query = old_prompt + "USER: {} ASSISTANT:".format(query)
if image is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
}
print(inputs)
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0])
response = response.split("<|end_of_text|>")[0]
print("\nCogVLM2:", response)
history.append((query, response))_
Expected behavior / 期待表现
A available multi-gpu run demo in future repo!
use basic_demo/cli_demo_multi_gpus.py
使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错
how to do multimle gpu inference with peft weights as cog
使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错 @zRzRzRzRzRzRzR I use basic_demo/cli_demo_multi_gpus.py,the same error: Traceback (most recent call last): File "/opt/bitmatrix/src/share-serv/serv_misc/src/cg2.py", line 100, in
outputs = model.generate(**inputs, **gen_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( ^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( ^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 620, in forward outputs = self.model( ^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 389, in forward images_features = self.encode_images(images) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 361, in encode_images images_features = self.vision(images) ^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 130, in forward x = self.transformer(x) ^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 94, in forward hidden_states = layer_module(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 83, in forward output = mlp_input + mlp_output ~~~~~~~~~~^~~~~~~~~~~~ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:2!
设置成2个卡试试?
设置成2个卡试试?
您好,设置成两张卡,就报显存不足的错误了(22G的显存,且空闲着,不论max_memory_per_gpu设置为多少,都报显存不足)
一样的问题。8张P100,怎么都不行。
same problem
每张显卡分配16G以上,最多三张卡
一样的问题。8张P100,怎么都不行。
P100应该是驱动,算子的问题了,要寻找对应的xformers版本(如果有支持这个卡)
每张显卡分配16G以上,最多三张卡
我使用3张4090成功了
3张2080Ti 22G,还是显存不足 o(╥﹏╥)o
3张4090我也可以成功,但就于多并发任务时的推理速度上不来。仓库up主有没有方法能通过4张或是8张显卡来auto_map一下
可以修改一下device_map, 某一层的权重被分配到不同显卡上了, 比如像这样:
这里在vision.transformer.layers.8下就会出现tensor计算不在同一个device上, 像我举得例子里你可以把layer.8整个改在同一个设备上
3张4090我也可以成功,但就于多并发任务时的推理速度上不来。仓库up主有没有方法能通过4张或是8张显卡来auto_map一下
成功了吗?
一样的问题,感觉是需要用权重被分到不同的卡上了?
出现了相同的问题 求解决
使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错
我也遇到了同样的错误,请问解决了吗
使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错
我也遇到了同样的错误,请问解决了吗
我也是 可以参考:
https://github.com/THUDM/CogVLM/issues/256
https://huggingface.co/THUDM/cogagent-chat-hf/tree/main
3张4090我也能成功,但是就于多并发任务时的推理速度上不来。仓库up主没有办法能通过4张或者8张显卡来auto_map一下
怎样设置并发啊
在4张T4卡上跑成功了。改了两个地方: 首先是原cli_demo_multi_gpus.py中device_map改成如下:
device_map = infer_auto_device_map(
model=model,
max_memory={i: max_memory_per_gpu for i in range(num_gpus)},
no_split_module_classes=["CogVLMDecoderLayer", "TransformerLayer", "Block"]
)
然后是在模型的module中的visual.py里的EVA2CLIPModel的forward中将boi和eoi传送到和x同一device中:
x = x.flatten(2).transpose(1, 2)
x = self.linear_proj(x)
boi = self.boi.expand(x.shape[0], -1, -1).to(x.device)
eoi = self.eoi.expand(x.shape[0], -1, -1).to(x.device)
x = torch.cat((boi, x, eoi), dim=1)
return x