accelerate dispatch_model with inferred device map fails multi-gpu execution with native torch function

System Info

I'm using the accelerate Python API, on a machine with 4 T4 GPUs.
- `Accelerate` version: 0.19.0
- Platform: Linux-5.19.0-1024-aws-x86_64-with-glibc2.35
- Python version: 3.10.6
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.0.dev20230508+cu118 (True)
- System RAM: 186.61 GB
- GPU type: Tesla T4
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Applying the accelerate-generated device_map to a model causes a device mismatch error, when two submodules are scheduled on different gpus, but a torch native function combines the two results in the forward call.

from accelerate.utils.modeling import infer_auto_device_map
from accelerate import dispatch_model
import torch
from torch import nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lin1 = nn.Linear(100, 100)
        self.lin2 = nn.Linear(100, 100)

    def forward(self, x):
        y = self.lin1(x)
        z = self.lin2(x)
        return torch.cat((y, z), 0)

net = Net()
max_memory = {0: 50000, 1: 50000, 2: 50000, 'cpu': 100000}
device_map = infer_auto_device_map(net, max_memory)
print("device map", device_map)
net = dispatch_model(net, device_map)
res = net(torch.randn(100, 100))

produces the error

  File ... line 15, in forward
    return torch.cat((y, z), 0)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:2! (when checking argument for argument tensors in method wrapper_CUDA_cat)

Expected behavior

The script executes without error.

What's the recommended pattern to resolve this bug? I would prefer not to modify the torch module. This issue is relevant for running larger models like vicuna and llama on devices with multiple machines.

May 12 '23 23:05 anwang2009

cc @sgugger

May 16 '23 12:05 muellerzr

I'm not sure I understand why you think this is a bug. Your layer needs all weights to be on the same device for the cat operation, so should be passed along as no_split_module_class when using infer_auto_device_map (like we do in Transformers for all attention blocks).

May 16 '23 13:05 sgugger

Yeah, this is more a feature request than a bug -- it would be really nice if I didn't have to manually annotate all no_split_module_classs for a model I'm unfamiliar with.

May 17 '23 20:05 anwang2009

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 12 '23 15:06 github-actions[bot]

our layer needs all weights to be on the same device for the cat operation,

how to determine which operator need all weights on the same device, such as torch.cat() ?

Dec 05 '23 08:12 DeeepSeeek

accelerate accelerate copied to clipboard

dispatch_model with inferred device map fails multi-gpu execution with native torch function

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard