InternVL Running internVL2.5-78B on Dual A40 GPUs with Accelerate CPU Offload

First of all, I would like to express my gratitude to the team for sharing such an impressive model and making it accessible to the community.

I am trying to run the internVL2.5-78B model on two NVIDIA A40 GPUs using [Hugging Face Accelerate] with CPU offloading enabled. However, I am encountering issues during the setup or execution.

Below are the details:

Setup:

Model: internVL2.5-78B
GPUs: Dual NVIDIA A40 (48GB VRAM each)
Framework: Hugging Face 4.45.1
CUDA Version: 11.7
PyTorch Version: 2.0.1
Python Version: 3.10

Steps Performed: 1.Followed the guide provided by the internVL documentation to configure the device_map:

path = 'OpenGVLab/InternVL2_5-78B'
device_map = split_model('InternVL2_5-78B')

# 8 bit load
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    use_flash_attn=True,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
    ).eval()

2.Re-assigned specific layers to the CPU using accelerate infer_auto_device_map.


auto_device_map = infer_auto_device_map(
    model,
  no_split_module_classes=model._get_no_split_modules(device_map="auto"),
    max_memory={0: "40GiB", 1: "40GiB", "cpu": "200GiB"},
) # device name as in

Dispatched the model using the modified device_map:

model = dispatch_model(
    model,
    device_map=auto_device_map,
    offload_dir="/tmp/offload"
)

4.Encountered the following error when running the model

return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

Additional Observations:

After executing dispatch_model, I checked the devices for the model layers and found that some layer weights were spread across two devices (e.g., cuda:0 and cuda:1). Could this be the cause of the observed error?

Questions:

If this is not the root cause of the issue, is there a recommended method or configuration for running the 78B model on two A40 GPUs?
Are there additional configurations or steps required to ensure tensors are properly synchronized across devices?
Could this issue be specific to using load_in_8bit=True with CPU offloading and multi-GPU setups?

Additional Context: If required, I can provide the full code or additional logs. Thank you for your assistance in resolving this issue!

Dec 24 '24 09:12 ryukh4520

Hello, I have the same problem. Have you solved it？

Dec 24 '24 17:12 suexin27

Same problem on two H20 96GB

Dec 28 '24 04:12 zijianchen98

Hello, I encountered the same problem on H100 80G

Apr 02 '25 08:04 LiJiaqi96