InternVL 多GPU推理报错

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:7!

Apr 26 '24 02:04 TerryYiDa

您好，感谢反馈，这个问题也有其他人遇到了，我们会优先解决这个问题。

Apr 26 '24 17:04 czczup

期待解决。。。。，对你们的项目点赞

Apr 28 '24 06:04 baigang666

@czczup i triggered and encountered the same issue, could you guys solve this and help us to experience this prefect project? Appreciate !

Apr 29 '24 01:04 hyhzl

you can sea example of multi gpu for inference from https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5. And similar solution from https://github.com/OpenGVLab/InternVL/issues/96, and A part of the code requires modification.

# modeling_internval_chat.py    line 353

input_embeds[selected] = vit_embeds.reshape(-1, C).cuda()

Apr 29 '24 03:04 NiYueLiuFeng

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题

Apr 29 '24 07:04 gaodianzhuo

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题，求修复

Apr 30 '24 01:04 winca

there is a hack you can use to fix this problem, basically aligning all the devices in the transformer library manually. But that will require you to fix it ad hoc.

May 09 '24 03:05 kyleliang919

期待解决，同样的问题！

May 10 '24 07:05 chengligen

这个bug很多人都遇到了，暂时没想到比较好的方法能够适配所有情况，有个方案可以解决，需要手工为模型分配设备。

以两张V100显卡部署这个26B模型为例：

这个模型总共26B，2张卡最理想的情况是每张卡13B。因此，除去ViT的6B以外，卡0还需要放7B，也就是20B的LLM有1/3在卡0，有2/3在卡1。

写成代码就是：

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

May 30 '24 13:05 czczup

Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.

For example, deploying this 26B model on two V100 GPUs:

The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.

In code, it would look like this:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

May 30 '24 13:05 czczup