pytorch-lightning
pytorch-lightning copied to clipboard
Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster
Bug description
I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Training Script:
import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L
# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))
# define the LightningModule
class LitAutoEncoder(L.LightningModule):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
def training_step(self, batch, batch_idx):
# training_step defines the train loop.
# it is independent of forward
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
# Logging to TensorBoard (if installed) by default
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters(), lr=1e-3)
return optimizer
# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)
# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)
# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
SLURM batch script:
#!/bin/bash
#SBATCH -p mi1004x
#SBATCH --nodes=2 # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=4 # This needs to match Trainer(devices=...)
#SBATCH --time=0-00:30:00
#SBATCH -e slurm-%j.err
source ~/miniconda3/bin/activate pylight
# run script from above
srun python train.py
Error messages and logs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Environment
Current environment
- CUDA:
- GPU:
- AMD Instinct MI100
- AMD Instinct MI100
- AMD Instinct MI100
- AMD Instinct MI100
- available: True
- version: None
- GPU:
- Lightning:
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
- Packages:
- absl-py: 2.1.0
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- annotated-types: 0.6.0
- async-timeout: 4.0.3
- attrs: 23.2.0
- certifi: 2022.12.7
- charset-normalizer: 2.1.1
- deepspeed: 0.14.0
- filelock: 3.9.0
- frozenlist: 1.4.1
- fsspec: 2023.4.0
- future: 1.0.0
- grpcio: 1.62.1
- hjson: 3.1.0
- idna: 3.4
- imageio: 2.34.0
- jinja2: 3.1.2
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- markdown: 3.6
- markupsafe: 2.1.3
- mpmath: 1.3.0
- multidict: 6.0.5
- networkx: 3.2.1
- ninja: 1.11.1.1
- numpy: 1.26.3
- packaging: 24.0
- pandas: 2.2.1
- pillow: 10.2.0
- pip: 23.3.1
- protobuf: 5.26.1
- psutil: 5.9.8
- py-cpuinfo: 9.0.0
- pydantic: 2.7.0
- pydantic-core: 2.18.1
- pynvml: 11.5.0
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- pytz: 2024.1
- pyyaml: 6.0.1
- requests: 2.28.1
- setuptools: 68.2.2
- six: 1.16.0
- sympy: 1.12
- tensorboard: 2.16.2
- tensorboard-data-server: 0.7.2
- test-tube: 0.7.5
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
- tqdm: 4.66.2
- typing-extensions: 4.8.0
- tzdata: 2024.1
- urllib3: 1.26.13
- werkzeug: 3.0.1
- wheel: 0.41.2
- yarl: 1.9.4
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.14
- release: 5.14.0-162.18.1.el9_1.x86_64
- version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023
More info
No response
Try using "srun python3 train.py". python --> python3
I tried python3, but the issue still remains.
I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.
A bottleneck for good especially if you can not do sruns but only sbatch within the environment you work.