pytorch-lightning
pytorch-lightning copied to clipboard
Fabric: Incorrect `num_replicas` (ddp/fsdp) when number of GPUs on each node is different
Bug description
When running multi-node/multi-GPU training with different number of GPUs on each node, Fabric
ddp
and fsdp
will have an incorrect num_replicas
in distributed_sampler_kwargs
: Currently num_replicas
is set to be num_gpus * num_nodes
, instead of simply world_size
.
To reproduce the bug: run fabric on two nodes, one with 2 GPUs and another with only one (the global_rank
on the second node should be 2
).
In that case, the num_replicas
on will be different on the two nodes: while on the node with two GPUs it will be 4, on the node with one GPU it will be 2.
Why not setting num_replicas
to simply be world_size
?
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Run lightning Fabric on three GPUs on two different nodes
from lightning.fabric import Fabric
from torchvision.datasets import MNIST
from torchvision import transforms as tvt
from torch.utils.data import DataLoader
# num_devices is either 2 or 1 depending on the node
num_devices =
fabric = Fabric(accelerator='cuda', strategy='ddp',
devices=num_devices, num_nodes=2)
fabric.launch()
# pick the simplest Dataset you want
train_data = DataLoader(MNIST(root='.', train=True, transform=tvt.ToTensor()), batch_size=3, num_workers=2)
# this will fail on the node with single GPU
train_loader = fabric.setup_dataloaders(train_data)
Error messages and logs
Traceback (most recent call last):
...
File ".../main_linprobe.py", line 225, in main
data_loader_train = fabric.setup_dataloaders(data_loader_train)
File ".../site-packages/lightning/fabric/fabric.py", line 376, in setup_dataloaders
dataloaders = [
File ".../site-packages/lightning/fabric/fabric.py", line 377, in <listcomp>
self._setup_dataloader(
File ".../site-packages/lightning/fabric/fabric.py", line 404, in _setup_dataloader
sampler = self._get_distributed_sampler(dataloader, **self._strategy.distributed_sampler_kwargs)
File ".../site-packages/lightning/fabric/fabric.py", line 1005, in _get_distributed_sampler
return DistributedSampler(dataloader.dataset, **kwargs)
File ".../site-packages/torch/utils/data/distributed.py", line 74, in __init__
raise ValueError(
ValueError: Invalid rank 2, rank should be in the interval [0, 1]
Environment
<details>
<summary>Current environment</summary>
* CUDA:
- GPU:
- NVIDIA A40
- NVIDIA A40
- available: True
- version: 12.1
* Lightning:
- lightning: 2.2.2
- lightning-cloud: 0.5.57
- lightning-utilities: 0.10.1
- open-clip-torch: 2.16.2
- pytorch-lightning: 2.1.3
- torch: 2.1.2
- torchaudio: 2.0.0
- torchmetrics: 0.11.4
- torchvision: 0.15.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.10
- release: 5.14.0-162.6.1.el9_1.x86_64
- version: #1 SMP PREEMPT_DYNAMIC Fri Sep 30 07:36:03 EDT 2022
#- How you installed Lightning(`conda`, `pip`, source): pip
#- Running environment of LightningApp (e.g. local, cloud): on-prem cluster managed by LSF.
</details>
More info
Looking at ddp.py
, it currently sets num_replicas
to be:
@property
@override
def distributed_sampler_kwargs(self) -> Dict[str, Any]:
return {"num_replicas": (self.num_nodes * self.num_processes), "rank": self.global_rank}
Why not simply set num_replicas
to be self.world_size
?