pytorch-lightning stuck at "Initializing distributed.." when using ddp with multiple gpus

First check

[x] I'm sure this is a bug.
[X] I've added a descriptive title to this bug.
[X] I've provided clear instructions on how to reproduce the bug.
[X] I've added a code sample.
[X] I've provided any other important info that is required.

Bug description

Dear community,

I'm desperately trying to achieve multi gpu training on our scientific SLURM cluster. It has one gpu (Tesla T4) per node, so specifically I want to achieve multi node multi gpu training. For testing I just use a minimal example from PL documentation (https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction.html) to ensure that the error is not build in my model. Even when allocating only 1 gpu the code hangs. When using "dp" instead of "ddp" it runs (though not really faster).

How to reproduce the bug

MODELL:
import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator="gpu",
        devices=1,
        strategy="ddp",
        num_nodes=4
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()


SLURM SUBMISSION SCRIPT:
#!/bin/bash
#SBATCH -J test
#SBATCH -e test.err
#SBATCH -o test.log
#SBATCH --tasks-per-node 8
#SBATCH --gres gpu:1
#SBATCH --nodes 4

SECONDS=0
python train.py
echo "$SECONDS seconds passed"

Error messages and logs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
`Trainer(limit_test_batches=1)` was configured so 1 batch will be used.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

Important info

cuda 11.2
torch 1.12.1
pytorch-lightning 1.7.6

More info

I tried out various things from many threads but still can't even get this minimal example to work :(

This is my first bug report ever, please don't hate and thanks a lot in advance!

Sep 19 '22 11:09 FlorianWieser1

Can you try launching this script instead? It's simpler and the one we use for bug reports https://github.com/Lightning-AI/lightning/blob/master/examples/pl_bug_report/bug_report_model.py

Also, os.environ['CUDA_LAUNCH_BLOCKING'] = '1' is for debugging and will make your run considerably slower

Sep 19 '22 15:09 carmocca

Dear carmocca, thank you for your reply! I tried your bug_report_model, but still run into the same issue. I only added lines accelerator="gpu", devices=1, strategy="ddp", num_nodes=4. Thank you in advance!

Sep 20 '22 07:09 FlorianWieser1

Hey @FlorianWieser1 Our SLURM docs here have a template for the slurm submission script. One thing which is very important is that the number of nodes and processes configured there need to match with what is in the Trainer! This mismatch is the most likely cause why it gets stuck. Try to change it:

#!/bin/bash
#SBATCH -J test
#SBATCH -e test.err
#SBATCH -o test.log
#SBATCH --nodes=4   <--- MUST MATCH num_nodes IN TRAINER
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1   <--- MUST MATCH devices IN TRAINER

Trainer(accelerator="gpu", devices=1, strategy="ddp", num_nodes=4)

Also, please pay close attention on how to invoke the script, it should be done with srun. You have

python train.py

But it should be

srun python train.py

I will try to make it more clear in the docs. I also have this related proposal open #10150

Sep 20 '22 12:09 awaelchli

Dear awaelchli, the missing "srun" command fixed it for me! :) Thanks a lot!

Sep 23 '22 06:09 FlorianWieser1

Great to see you were able to solve it.

If only we could provide a warning if the user forgot that. I thought about this but it seems impossible, because we can only auto-detect SLURM in a reliable way if the corresponding environment variables are set. But they only get set when you run with srun 😄

So this is a pickle. Writing it in bold in the doc will help some people, but would it have helped you @FlorianWieser1? Where were you looking first before coming to us in slack/github?

Sep 23 '22 15:09 awaelchli

Haha I see thats a Problem :joy:

I don't know to be honest, I've been on this documentation side you linked mulitple time, but still I overlooked or forgot about the "srun". I'm not new to SLUM so I knew about "srun". I guess the documentation is fine and this was really my fault :) Before I came here I randomly searched the internet for things like "slurm pytorch lightning ddp multi node" etc.

Best, Florian

Sep 26 '22 08:09 FlorianWieser1