InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

多GPU推理报错

Open TerryYiDa opened this issue 1 year ago • 8 comments

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:7!

TerryYiDa avatar Apr 26 '24 02:04 TerryYiDa

您好,感谢反馈,这个问题也有其他人遇到了,我们会优先解决这个问题。

czczup avatar Apr 26 '24 17:04 czczup

期待解决。。。。,对你们的项目点赞

baigang666 avatar Apr 28 '24 06:04 baigang666

@czczup i triggered and encountered the same issue, could you guys solve this and help us to experience this prefect project? Appreciate !

hyhzl avatar Apr 29 '24 01:04 hyhzl

you can sea example of multi gpu for inference from https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5. And similar solution from https://github.com/OpenGVLab/InternVL/issues/96, and A part of the code requires modification.

# modeling_internval_chat.py    line 353

input_embeds[selected] = vit_embeds.reshape(-1, C).cuda()

NiYueLiuFeng avatar Apr 29 '24 03:04 NiYueLiuFeng

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题

gaodianzhuo avatar Apr 29 '24 07:04 gaodianzhuo

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题,求修复

winca avatar Apr 30 '24 01:04 winca

there is a hack you can use to fix this problem, basically aligning all the devices in the transformer library manually. But that will require you to fix it ad hoc.

kyleliang919 avatar May 09 '24 03:05 kyleliang919

期待解决,同样的问题! image

chengligen avatar May 10 '24 07:05 chengligen

这个bug很多人都遇到了,暂时没想到比较好的方法能够适配所有情况,有个方案可以解决,需要手工为模型分配设备。

以两张V100显卡部署这个26B模型为例:

这个模型总共26B,2张卡最理想的情况是每张卡13B。因此,除去ViT的6B以外,卡0还需要放7B,也就是20B的LLM有1/3在卡0,有2/3在卡1。

写成代码就是:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

czczup avatar May 30 '24 13:05 czczup

Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.

For example, deploying this 26B model on two V100 GPUs:

The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.

In code, it would look like this:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

czczup avatar May 30 '24 13:05 czczup