CogVLM2 icon indicating copy to clipboard operation
CogVLM2 copied to clipboard

Multi-GPU inference Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cuda:6!

Open RussellEven opened this issue 1 year ago • 7 comments

System Info / 系統信息

system version: Ubuntu 20.04 LTS cuda version: 11.8 python version: 3.10.12 torch version: 2.3.0+cu118 xformers version: 0.0.26.post1+cu118

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • [ ] The official example scripts / 官方的示例脚本
  • [X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Bug Info

.../huggingface/modules/transformers_modules/cogvlm2-llama3-chat-19B/visual.py", line 83, in forward output = mlp_input + mlp_output RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cuda:6!

Demo is above:

_import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer from torch.nn.parallel import DistributedDataParallel as DDP import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3, 4, 5, 6, 7" max_memory_mapping = {0: "20GB", 1: "20GB", 2: "20GB", 3: "20GB", 4: "20GB", 5: "20GB", 6: "20GB", 7: "20GB"}

#MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B" MODEL_PATH = "./cogvlm2-llama3-chat-19B"

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained( MODEL_PATH, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, device_map='auto', max_memory=max_memory_mapping, load_in_8bit=False, torch_dtype=TORCH_TYPE, trust_remote_code=True, ).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True: image_path = input("image path >>>>> ") if image_path == '': print('You did not enter image path, the following will be a plain text conversation.') image = None text_only_first_query = True else: image = Image.open(image_path).convert('RGB')

history = []

while True:
    query = input("Human:")
    if query == "clear":
        break

    if image is None:
        if text_only_first_query:
            query = text_only_template.format(query)
            text_only_first_query = False
        else:
            old_prompt = ''
            for _, (old_query, response) in enumerate(history):
                old_prompt += old_query + " " + response + "\n"
            query = old_prompt + "USER: {} ASSISTANT:".format(query)
    if image is None:
        input_by_model = model.build_conversation_input_ids(
            tokenizer,
            query=query,
            history=history,
            template_version='chat'
        )
    else:
        input_by_model = model.build_conversation_input_ids(
            tokenizer,
            query=query,
            history=history,
            images=[image],
            template_version='chat'
        )
    inputs = {
        'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
        'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
        'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
        'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
    }
    gen_kwargs = {
        "max_new_tokens": 2048,
        "pad_token_id": 128002,  
    }
    print(inputs)
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        outputs = outputs[:, inputs['input_ids'].shape[1]:]
        response = tokenizer.decode(outputs[0])
        response = response.split("<|end_of_text|>")[0]
        print("\nCogVLM2:", response)
    history.append((query, response))_

Expected behavior / 期待表现

A available multi-gpu run demo in future repo!

RussellEven avatar May 23 '24 08:05 RussellEven

use basic_demo/cli_demo_multi_gpus.py

zRzRzRzRzRzRzR avatar May 23 '24 10:05 zRzRzRzRzRzRzR

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错

yhygta avatar May 23 '24 11:05 yhygta

how to do multimle gpu inference with peft weights as cog

Jayantverma2 avatar May 23 '24 13:05 Jayantverma2

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错 @zRzRzRzRzRzRzR I use basic_demo/cli_demo_multi_gpus.py,the same error: Traceback (most recent call last): File "/opt/bitmatrix/src/share-serv/serv_misc/src/cg2.py", line 100, in outputs = model.generate(**inputs, **gen_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( ^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( ^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 620, in forward outputs = self.model( ^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 389, in forward images_features = self.encode_images(images) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 361, in encode_images images_features = self.vision(images) ^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 130, in forward x = self.transformer(x) ^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 94, in forward hidden_states = layer_module(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 83, in forward output = mlp_input + mlp_output ~~~~~~~~~~^~~~~~~~~~~~ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:2!

yhygta avatar May 24 '24 02:05 yhygta

设置成2个卡试试?

zRzRzRzRzRzRzR avatar May 24 '24 13:05 zRzRzRzRzRzRzR

设置成2个卡试试?

您好,设置成两张卡,就报显存不足的错误了(22G的显存,且空闲着,不论max_memory_per_gpu设置为多少,都报显存不足)

yhygta avatar May 27 '24 07:05 yhygta

一样的问题。8张P100,怎么都不行。

lvbinandylau avatar May 27 '24 07:05 lvbinandylau

same problem

valencebond avatar May 28 '24 13:05 valencebond

每张显卡分配16G以上,最多三张卡

zRzRzRzRzRzRzR avatar May 29 '24 06:05 zRzRzRzRzRzRzR

一样的问题。8张P100,怎么都不行。

P100应该是驱动,算子的问题了,要寻找对应的xformers版本(如果有支持这个卡)

zRzRzRzRzRzRzR avatar May 29 '24 06:05 zRzRzRzRzRzRzR

每张显卡分配16G以上,最多三张卡

我使用3张4090成功了

alice20212 avatar May 30 '24 09:05 alice20212

3张2080Ti 22G,还是显存不足 o(╥﹏╥)o

yhygta avatar May 30 '24 10:05 yhygta

3张4090我也可以成功,但就于多并发任务时的推理速度上不来。仓库up主有没有方法能通过4张或是8张显卡来auto_map一下

Andy-Zyu avatar Jun 02 '24 04:06 Andy-Zyu

可以修改一下device_map, 某一层的权重被分配到不同显卡上了, 比如像这样: image 这里在vision.transformer.layers.8下就会出现tensor计算不在同一个device上, 像我举得例子里你可以把layer.8整个改在同一个设备上

liuky74 avatar Jun 20 '24 13:06 liuky74

3张4090我也可以成功,但就于多并发任务时的推理速度上不来。仓库up主有没有方法能通过4张或是8张显卡来auto_map一下

成功了吗?

sevenclay avatar Jun 30 '24 01:06 sevenclay

一样的问题,感觉是需要用权重被分到不同的卡上了?

tingxueronghua avatar Jul 15 '24 07:07 tingxueronghua

出现了相同的问题 求解决

HJT9328 avatar Jul 16 '24 07:07 HJT9328

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错

我也遇到了同样的错误,请问解决了吗

WangWei990215 avatar Aug 29 '24 02:08 WangWei990215

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错

我也遇到了同样的错误,请问解决了吗

我也是 可以参考:

https://github.com/THUDM/CogVLM/issues/256

https://huggingface.co/THUDM/cogagent-chat-hf/tree/main

byerose avatar Sep 03 '24 08:09 byerose

3张4090我也能成功,但是就于多并发任务时的推理速度上不来。仓库up主没有办法能通过4张或者8张显卡来auto_map一下

怎样设置并发啊

jacky080808 avatar Oct 28 '24 03:10 jacky080808

在4张T4卡上跑成功了。改了两个地方: 首先是原cli_demo_multi_gpus.py中device_map改成如下:

device_map = infer_auto_device_map(
    model=model,
    max_memory={i: max_memory_per_gpu for i in range(num_gpus)},
    no_split_module_classes=["CogVLMDecoderLayer", "TransformerLayer", "Block"]
)

然后是在模型的module中的visual.py里的EVA2CLIPModel的forward中将boi和eoi传送到和x同一device中:

        x = x.flatten(2).transpose(1, 2)
        x = self.linear_proj(x)
        boi = self.boi.expand(x.shape[0], -1, -1).to(x.device)
        eoi = self.eoi.expand(x.shape[0], -1, -1).to(x.device)
        x = torch.cat((boi, x, eoi), dim=1)
        return x

hxuaj avatar Oct 31 '24 10:10 hxuaj