pytorch-lightning Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster

Bug description

I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Training Script:


import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))


# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

SLURM batch script:

#!/bin/bash

#SBATCH -p mi1004x
#SBATCH --nodes=2             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=4   # This needs to match Trainer(devices=...)
#SBATCH --time=0-00:30:00
#SBATCH -e slurm-%j.err

source ~/miniconda3/bin/activate pylight
# run script from above
srun python train.py

Error messages and logs

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Environment

Current environment

CUDA:
- GPU:
  - AMD Instinct MI100
  - AMD Instinct MI100
  - AMD Instinct MI100
  - AMD Instinct MI100
- available: True
- version: None
Lightning:
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
Packages:
- absl-py: 2.1.0
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- annotated-types: 0.6.0
- async-timeout: 4.0.3
- attrs: 23.2.0
- certifi: 2022.12.7
- charset-normalizer: 2.1.1
- deepspeed: 0.14.0
- filelock: 3.9.0
- frozenlist: 1.4.1
- fsspec: 2023.4.0
- future: 1.0.0
- grpcio: 1.62.1
- hjson: 3.1.0
- idna: 3.4
- imageio: 2.34.0
- jinja2: 3.1.2
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- markdown: 3.6
- markupsafe: 2.1.3
- mpmath: 1.3.0
- multidict: 6.0.5
- networkx: 3.2.1
- ninja: 1.11.1.1
- numpy: 1.26.3
- packaging: 24.0
- pandas: 2.2.1
- pillow: 10.2.0
- pip: 23.3.1
- protobuf: 5.26.1
- psutil: 5.9.8
- py-cpuinfo: 9.0.0
- pydantic: 2.7.0
- pydantic-core: 2.18.1
- pynvml: 11.5.0
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- pytz: 2024.1
- pyyaml: 6.0.1
- requests: 2.28.1
- setuptools: 68.2.2
- six: 1.16.0
- sympy: 1.12
- tensorboard: 2.16.2
- tensorboard-data-server: 0.7.2
- test-tube: 0.7.5
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
- tqdm: 4.66.2
- typing-extensions: 4.8.0
- tzdata: 2024.1
- urllib3: 1.26.13
- werkzeug: 3.0.1
- wheel: 0.41.2
- yarl: 1.9.4
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.10.14
- release: 5.14.0-162.18.1.el9_1.x86_64
- version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023

More info

No response

Apr 25 '24 21:04 OswaldHe

Try using "srun python3 train.py". python --> python3

Apr 30 '24 16:04 jaydeepradeJD

I tried python3, but the issue still remains.

Apr 30 '24 17:04 OswaldHe

I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.

May 01 '24 12:05 FelixBrakel

A bottleneck for good especially if you can not do sruns but only sbatch within the environment you work.

May 26 '24 20:05 Furkan9015

pytorch-lightning pytorch-lightning copied to clipboard

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

pytorch-lightning
pytorch-lightning copied to clipboard