accelerate
accelerate copied to clipboard
dispatch_model with inferred device map fails multi-gpu execution with native torch function
System Info
I'm using the accelerate Python API, on a machine with 4 T4 GPUs.
- `Accelerate` version: 0.19.0
- Platform: Linux-5.19.0-1024-aws-x86_64-with-glibc2.35
- Python version: 3.10.6
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.0.dev20230508+cu118 (True)
- System RAM: 186.61 GB
- GPU type: Tesla T4
- `Accelerate` default config:
Not found
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
Applying the accelerate-generated device_map to a model causes a device mismatch error, when two submodules are scheduled on different gpus, but a torch native function combines the two results in the forward call.
from accelerate.utils.modeling import infer_auto_device_map
from accelerate import dispatch_model
import torch
from torch import nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.lin1 = nn.Linear(100, 100)
self.lin2 = nn.Linear(100, 100)
def forward(self, x):
y = self.lin1(x)
z = self.lin2(x)
return torch.cat((y, z), 0)
net = Net()
max_memory = {0: 50000, 1: 50000, 2: 50000, 'cpu': 100000}
device_map = infer_auto_device_map(net, max_memory)
print("device map", device_map)
net = dispatch_model(net, device_map)
res = net(torch.randn(100, 100))
produces the error
File ... line 15, in forward
return torch.cat((y, z), 0)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:2! (when checking argument for argument tensors in method wrapper_CUDA_cat)
Expected behavior
The script executes without error.
What's the recommended pattern to resolve this bug? I would prefer not to modify the torch module. This issue is relevant for running larger models like vicuna and llama on devices with multiple machines.
cc @sgugger
I'm not sure I understand why you think this is a bug. Your layer needs all weights to be on the same device for the cat operation, so should be passed along as no_split_module_class
when using infer_auto_device_map
(like we do in Transformers for all attention blocks).
Yeah, this is more a feature request than a bug -- it would be really nice if I didn't have to manually annotate all no_split_module_class
s for a model I'm unfamiliar with.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
our layer needs all weights to be on the same device for the cat operation,
how to determine which operator need all weights on the same device, such as torch.cat() ?