pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Fabric: Incorrect `num_replicas` (ddp/fsdp) when number of GPUs on each node is different

Open shaibagon opened this issue 9 months ago • 2 comments

Bug description

When running multi-node/multi-GPU training with different number of GPUs on each node, Fabric ddp and fsdp will have an incorrect num_replicas in distributed_sampler_kwargs: Currently num_replicas is set to be num_gpus * num_nodes, instead of simply world_size.

To reproduce the bug: run fabric on two nodes, one with 2 GPUs and another with only one (the global_rank on the second node should be 2). In that case, the num_replicas on will be different on the two nodes: while on the node with two GPUs it will be 4, on the node with one GPU it will be 2.

Why not setting num_replicas to simply be world_size?

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Run lightning Fabric on three GPUs on two different nodes

from lightning.fabric import Fabric
from torchvision.datasets import MNIST
from torchvision import transforms as tvt
from torch.utils.data import DataLoader

# num_devices is either 2 or 1 depending on the node
num_devices = 
fabric = Fabric(accelerator='cuda', strategy='ddp', 
                    devices=num_devices, num_nodes=2)
fabric.launch()

# pick the simplest Dataset you want
train_data = DataLoader(MNIST(root='.', train=True, transform=tvt.ToTensor()), batch_size=3, num_workers=2)

# this will fail on the node with single GPU
train_loader = fabric.setup_dataloaders(train_data)

Error messages and logs

Traceback (most recent call last):
...
  File ".../main_linprobe.py", line 225, in main
    data_loader_train = fabric.setup_dataloaders(data_loader_train)
  File ".../site-packages/lightning/fabric/fabric.py", line 376, in setup_dataloaders
     dataloaders = [
  File ".../site-packages/lightning/fabric/fabric.py", line 377, in <listcomp>
    self._setup_dataloader(
  File ".../site-packages/lightning/fabric/fabric.py", line 404, in _setup_dataloader
    sampler = self._get_distributed_sampler(dataloader, **self._strategy.distributed_sampler_kwargs)
  File ".../site-packages/lightning/fabric/fabric.py", line 1005, in _get_distributed_sampler
    return DistributedSampler(dataloader.dataset, **kwargs)
  File ".../site-packages/torch/utils/data/distributed.py", line 74, in __init__
    raise ValueError(
ValueError: Invalid rank 2, rank should be in the interval [0, 1]

Environment

<details>
  <summary>Current environment</summary>
* CUDA:
        - GPU:
                - NVIDIA A40
                - NVIDIA A40
        - available:         True
        - version:           12.1
* Lightning:
        - lightning:         2.2.2
        - lightning-cloud:   0.5.57
        - lightning-utilities: 0.10.1
        - open-clip-torch:   2.16.2
        - pytorch-lightning: 2.1.3
        - torch:             2.1.2
        - torchaudio:        2.0.0
        - torchmetrics:      0.11.4
        - torchvision:       0.15.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.10.10
        - release:           5.14.0-162.6.1.el9_1.x86_64
        - version:           #1 SMP PREEMPT_DYNAMIC Fri Sep 30 07:36:03 EDT 2022

#- How you installed Lightning(`conda`, `pip`, source): pip
#- Running environment of LightningApp (e.g. local, cloud): on-prem cluster managed by LSF.

</details>

More info

Looking at ddp.py, it currently sets num_replicas to be:

    @property
    @override
    def distributed_sampler_kwargs(self) -> Dict[str, Any]:
        return {"num_replicas": (self.num_nodes * self.num_processes), "rank": self.global_rank}

Why not simply set num_replicas to be self.world_size?

related

shaibagon avatar May 23 '24 06:05 shaibagon