Running internVL2.5-78B on Dual A40 GPUs with Accelerate CPU Offload
First of all, I would like to express my gratitude to the team for sharing such an impressive model and making it accessible to the community.
I am trying to run the internVL2.5-78B model on two NVIDIA A40 GPUs using [Hugging Face Accelerate] with CPU offloading enabled. However, I am encountering issues during the setup or execution.
Below are the details:
Setup:
Model: internVL2.5-78B
GPUs: Dual NVIDIA A40 (48GB VRAM each)
Framework: Hugging Face 4.45.1
CUDA Version: 11.7
PyTorch Version: 2.0.1
Python Version: 3.10
Steps Performed: 1.Followed the guide provided by the internVL documentation to configure the device_map:
path = 'OpenGVLab/InternVL2_5-78B'
device_map = split_model('InternVL2_5-78B')
# 8 bit load
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
use_flash_attn=True,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map
).eval()
2.Re-assigned specific layers to the CPU using accelerate infer_auto_device_map.
auto_device_map = infer_auto_device_map(
model,
no_split_module_classes=model._get_no_split_modules(device_map="auto"),
max_memory={0: "40GiB", 1: "40GiB", "cpu": "200GiB"},
) # device name as in
- Dispatched the model using the modified device_map:
model = dispatch_model(
model,
device_map=auto_device_map,
offload_dir="/tmp/offload"
)
4.Encountered the following error when running the model
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)
Additional Observations:
After executing dispatch_model, I checked the devices for the model layers and found that some layer weights were spread across two devices (e.g., cuda:0 and cuda:1). Could this be the cause of the observed error?
Questions:
-
If this is not the root cause of the issue, is there a recommended method or configuration for running the 78B model on two A40 GPUs?
-
Are there additional configurations or steps required to ensure tensors are properly synchronized across devices?
-
Could this issue be specific to using load_in_8bit=True with CPU offloading and multi-GPU setups?
Additional Context: If required, I can provide the full code or additional logs. Thank you for your assistance in resolving this issue!
Hello, I have the same problem. Have you solved it?
Same problem on two H20 96GB
Hello, I encountered the same problem on H100 80G