clearml clearml does not support pytorch-lightning with multi-gpus

Hi, I am trying to run clearml with pytorch-lightning on multiple gpus, but the agent does not catch anything that happens within the fit function (progress bar, tensorboard scalars/plots ets.) when using pytorch on multi-gpu or pytorch-lightning on CPU / single GPU everything works fine. To be sure, I also ran your example code on multi-gpu and it didn't work either (see the attached file for the corresponding adjustments).

specs: ubuntu 20.04 dgx (A100X8) python 3.8.12 CUDA 11.4 torch 1.10.0 pytorch-lightning 1.6.0 cleaml 1.3.0 clearml-agent 1.1.2

clearml_example.txt

Apr 07 '22 13:04 manelabinyamin

Hi @manelabinyamin,

I'll take a look at it, I (unfortunately) still do not have a DGX, but I'll hunt a machine with multiple GPU's :)

Apr 10 '22 10:04 erezalg

Thanks :) Please keep me updated

Apr 10 '22 10:04 manelabinyamin

Hi @manelabinyamin ,

Was able to reproduce this issue. I just want to make sure what we're seeing is the same. The only difference between multi GPU and single GPU is I don't see some of the metrics reported, which are: "epoch", "test_loss" and "valid_loss". On multi GPU I do see "hp_metric" I also see progress bar (but less reports, I guess it's because of more processing power?). I don't see on neither of them any plots so could not compare.

Let me know if this is what you also see and I can move forward with fixing this :)

Apr 11 '22 08:04 erezalg

Hi @erezalg, I know there is no use of plots in the example script, but from my experience, it won't work either. In general, the agent won't capture anything within the 'fit' function (the training loop), and that is why you can see the hp metric, etc. A simple test you can run is trying to plot an empty image inside the training step by replacing with the following code.

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.logger.experiment.add_image('check_plot', np.zeros(10,10, 1), self.global_step, , dataformats='HWC')
        return loss

My best guess is that there is some problem with the devices' ranks...

thanks a lot!

Apr 13 '22 08:04 manelabinyamin

Hi @manelabinyamin ,

Yeah that makes sense :) We're looking into it and hopefully will come up with a solution soon!

Apr 13 '22 09:04 erezalg

Hi @manelabinyamin ,

Sorry to be late. We've looked into it and the great news is that we've found a solution. Always you've to initialize the Task before writing any model code.

Here I've edited the code you mentioned on clearml_example.txt and it will work now,

from argparse import ArgumentParser
import torch
import pytorch_lightning as pl
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from clearml import Task

from torchvision.datasets.mnist import MNIST
from torchvision import transforms

# Connecting ClearML with the current process,
# from here on everything is logged automatically
task = Task.init(project_name="examples", task_name="PyTorch lightning MNIST example")

class LitClassifier(pl.LightningModule):
    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()

        self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
        self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = torch.relu(self.l1(x))
        x = torch.relu(self.l2(x))
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('valid_loss', loss)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('test_loss', loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

    @staticmethod
    def add_model_specific_args(parent_parser):
        parser = ArgumentParser(parents=[parent_parser], add_help=False)
        parser.add_argument('--hidden_dim', type=int, default=128)
        parser.add_argument('--learning_rate', type=float, default=0.0001)
        return parser


if __name__ == '__main__':
    pl.seed_everything(0)

    parser = ArgumentParser()
    parser.add_argument('--batch_size', default=32, type=int)
    parser = pl.Trainer.add_argparse_args(parser)
    parser.set_defaults(max_epochs=3, gpus=8)
    parser = LitClassifier.add_model_specific_args(parser)
    args = parser.parse_args()

    # ------------
    # data
    # ------------
    dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
    mnist_test = MNIST('', train=False, download=True, transform=transforms.ToTensor())
    mnist_train, mnist_val = random_split(dataset, [55000, 5000])

    train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
    val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
    test_loader = DataLoader(mnist_test, batch_size=args.batch_size)

    # ------------
    # model
    # ------------
    model = LitClassifier(args.hidden_dim, args.learning_rate)

    # ------------
    # training
    # ------------
    trainer = pl.Trainer.from_argparse_args(args)
    trainer.fit(model, train_loader, val_loader)

    # ------------
    # testing
    # ------------
    trainer.test(test_dataloaders=test_loader)

Apr 22 '22 20:04 Rizwan-Hasan

Hi @manelabinyamin ,

Are you still facing it? Have you applied our solution? Please let us know.

May 16 '22 10:05 Rizwan-Hasan

Hi @Rizwan-Hasan , I am using clearml-agent version 1.4.1 and clearml version 1.8.0, and this is not working for multi-gpus. I am using the example script at - https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch-lightning/pytorch_lightning_example.py I made a small modification to test on cpu only machines- Replaced parser.set_defaults(max_epochs=3) with

if torch.cuda.is_available():
        parser.set_defaults(max_epochs=3, accelerator="gpu", devices=-1)
else:
        parser.set_defaults(max_epochs=3)

Here are the results:

Doesn't work with devices = -1 on a 8 GPU machine
Works with devices = -1 on a single GPU machine
Works with devices = 1 on a 8 GPU machine So, I am only able to use one GPU at a time. The tail of execution log is below:

 Environment setup completed successfully
 Starting Task Execution:
 2022-11-22 13:20:31
 ClearML results page: https://app.clearml.dev.xxx.net/projects/1711e7e1538f454186422bc88362ad4b/experiments/9103459c7f70447b9ce08eaef21f4659/output/log
 Global seed set to 0
 Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to MNIST/raw/train-images-idx3-ubyte.gz
 100% 9912422/9912422 [00:00<00:00, 55682849.90it/s]
 Extracting MNIST/raw/train-images-idx3-ubyte.gz to MNIST/raw
 2022-11-22 13:20:36
 Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to MNIST/raw/train-labels-idx1-ubyte.gz
 100% 28881/28881 [00:00<00:00, 5670084.90it/s]
 Extracting MNIST/raw/train-labels-idx1-ubyte.gz to MNIST/raw
 Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to MNIST/raw/t10k-images-idx3-ubyte.gz
 100% 1648877/1648877 [00:00<00:00, 13600305.59it/s]
 Extracting MNIST/raw/t10k-images-idx3-ubyte.gz to MNIST/raw
 Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to MNIST/raw/t10k-labels-idx1-ubyte.gz
 100% 4542/4542 [00:00<00:00, 18006170.86it/s]
 Extracting MNIST/raw/t10k-labels-idx1-ubyte.gz to MNIST/raw
 GPU available: True (cuda), used: True
 TPU available: False, using: 0 TPU cores
 IPU available: False, using: 0 IPUs
 HPU available: False, using: 0 HPUs
 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
 2022-11-22 13:20:41
 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
 2022-11-22 13:20:47
 Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
 2022-11-22 13:20:52
 Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
 Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
 2022-11-22 13:20:57
 Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
 2022-11-22 13:21:02
 Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
 2022-11-22 13:21:07
 Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
 ----------------------------------------------------------------------------------------------------
 distributed_backend=nccl
 All distributed processes registered. Starting with 8 processes
 ----------------------------------------------------------------------------------------------------
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 2022-11-22 13:23:32
 ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start

The task execution is not logged after this point, there seems to be no progress even after a long time, and the CPU usage is stuck at around 25% while the GPU usage is 0.

I am using the following command to start the task: clearml-task --project ClearMLpractice --name hello_ptl --repo [email protected]:xx/xx.git --branch master --script pytorch_lightning/ptl_mnist.py --args batch_size=64 max_epochs=30 --docker pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime --docker_args "-v /home/xxx/.ssh:/root/.ssh:ro" --queue default

Nov 22 '22 07:11 ssetu

Running the same example on a 8GPU machine directly leads to the following issue:

Traceback (most recent call last):
  File "test.py", line 93, in <module>
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 582, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS

Nov 22 '22 10:11 ssetu

Hi @ssetu,

I'll take a look at it your issue, and update you soon.

Nov 22 '22 10:11 Rizwan-Hasan

@Rizwan-Hasan I've fond a solution. Multi-GPU requires interprocess communication. So either the --ipc=host flag should be used or larger shared memory needs to be allocated using the --shm-size flag.

Nov 23 '22 10:11 ssetu

@ssetu That's good to hear. Can you please comment the solution code here?

Nov 23 '22 10:11 Rizwan-Hasan

One solution is to add this line in the clear.conf file of the agent: extra_docker_arguments: ["--ipc=host", ] Alternatively, we can also use a larger shared memory by specifying extra_docker_arguments: ["--shm-size=8g", ] Be careful not to exceed your RAM size when using the latter.

Nov 23 '22 10:11 ssetu

For me the only solution I found to make clearml log scalars when using multiple GPUs is to make the Task part of the LightningModule, i.e.:

class LitClassifier(pl.LightningModule):
    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.task = Task.init(...)

Jan 17 '23 14:01 kampelmuehler

clearml clearml copied to clipboard

clearml does not support pytorch-lightning with multi-gpus

clearml
clearml copied to clipboard