clearml
clearml copied to clipboard
clearml does not support pytorch-lightning with multi-gpus
Hi, I am trying to run clearml with pytorch-lightning on multiple gpus, but the agent does not catch anything that happens within the fit function (progress bar, tensorboard scalars/plots ets.) when using pytorch on multi-gpu or pytorch-lightning on CPU / single GPU everything works fine. To be sure, I also ran your example code on multi-gpu and it didn't work either (see the attached file for the corresponding adjustments).
specs: ubuntu 20.04 dgx (A100X8) python 3.8.12 CUDA 11.4 torch 1.10.0 pytorch-lightning 1.6.0 cleaml 1.3.0 clearml-agent 1.1.2
Hi @manelabinyamin,
I'll take a look at it, I (unfortunately) still do not have a DGX, but I'll hunt a machine with multiple GPU's :)
Thanks :) Please keep me updated
Hi @manelabinyamin ,
Was able to reproduce this issue. I just want to make sure what we're seeing is the same. The only difference between multi GPU and single GPU is I don't see some of the metrics reported, which are: "epoch", "test_loss" and "valid_loss". On multi GPU I do see "hp_metric" I also see progress bar (but less reports, I guess it's because of more processing power?). I don't see on neither of them any plots so could not compare.
Let me know if this is what you also see and I can move forward with fixing this :)
Hi @erezalg, I know there is no use of plots in the example script, but from my experience, it won't work either. In general, the agent won't capture anything within the 'fit' function (the training loop), and that is why you can see the hp metric, etc. A simple test you can run is trying to plot an empty image inside the training step by replacing with the following code.
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
self.logger.experiment.add_image('check_plot', np.zeros(10,10, 1), self.global_step, , dataformats='HWC')
return loss
My best guess is that there is some problem with the devices' ranks...
thanks a lot!
Hi @manelabinyamin ,
Yeah that makes sense :) We're looking into it and hopefully will come up with a solution soon!
Hi @manelabinyamin ,
Sorry to be late. We've looked into it and the great news is that we've found a solution. Always you've to initialize the Task before writing any model code.
Here I've edited the code you mentioned on clearml_example.txt and it will work now,
from argparse import ArgumentParser
import torch
import pytorch_lightning as pl
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from clearml import Task
from torchvision.datasets.mnist import MNIST
from torchvision import transforms
# Connecting ClearML with the current process,
# from here on everything is logged automatically
task = Task.init(project_name="examples", task_name="PyTorch lightning MNIST example")
class LitClassifier(pl.LightningModule):
def __init__(self, hidden_dim=128, learning_rate=1e-3):
super().__init__()
self.save_hyperparameters()
self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = torch.relu(self.l1(x))
x = torch.relu(self.l2(x))
return x
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
self.log('valid_loss', loss)
def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
self.log('test_loss', loss)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
@staticmethod
def add_model_specific_args(parent_parser):
parser = ArgumentParser(parents=[parent_parser], add_help=False)
parser.add_argument('--hidden_dim', type=int, default=128)
parser.add_argument('--learning_rate', type=float, default=0.0001)
return parser
if __name__ == '__main__':
pl.seed_everything(0)
parser = ArgumentParser()
parser.add_argument('--batch_size', default=32, type=int)
parser = pl.Trainer.add_argparse_args(parser)
parser.set_defaults(max_epochs=3, gpus=8)
parser = LitClassifier.add_model_specific_args(parser)
args = parser.parse_args()
# ------------
# data
# ------------
dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
mnist_test = MNIST('', train=False, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])
train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
test_loader = DataLoader(mnist_test, batch_size=args.batch_size)
# ------------
# model
# ------------
model = LitClassifier(args.hidden_dim, args.learning_rate)
# ------------
# training
# ------------
trainer = pl.Trainer.from_argparse_args(args)
trainer.fit(model, train_loader, val_loader)
# ------------
# testing
# ------------
trainer.test(test_dataloaders=test_loader)
Hi @manelabinyamin ,
Are you still facing it? Have you applied our solution? Please let us know.
Hi @Rizwan-Hasan ,
I am using clearml-agent version 1.4.1 and clearml version 1.8.0, and this is not working for multi-gpus. I am using the example script at - https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch-lightning/pytorch_lightning_example.py
I made a small modification to test on cpu only machines-
Replaced parser.set_defaults(max_epochs=3)
with
if torch.cuda.is_available():
parser.set_defaults(max_epochs=3, accelerator="gpu", devices=-1)
else:
parser.set_defaults(max_epochs=3)
Here are the results:
- Doesn't work with devices = -1 on a 8 GPU machine
- Works with devices = -1 on a single GPU machine
- Works with devices = 1 on a 8 GPU machine So, I am only able to use one GPU at a time. The tail of execution log is below:
Environment setup completed successfully
Starting Task Execution:
2022-11-22 13:20:31
ClearML results page: https://app.clearml.dev.xxx.net/projects/1711e7e1538f454186422bc88362ad4b/experiments/9103459c7f70447b9ce08eaef21f4659/output/log
Global seed set to 0
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to MNIST/raw/train-images-idx3-ubyte.gz
100% 9912422/9912422 [00:00<00:00, 55682849.90it/s]
Extracting MNIST/raw/train-images-idx3-ubyte.gz to MNIST/raw
2022-11-22 13:20:36
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to MNIST/raw/train-labels-idx1-ubyte.gz
100% 28881/28881 [00:00<00:00, 5670084.90it/s]
Extracting MNIST/raw/train-labels-idx1-ubyte.gz to MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to MNIST/raw/t10k-images-idx3-ubyte.gz
100% 1648877/1648877 [00:00<00:00, 13600305.59it/s]
Extracting MNIST/raw/t10k-images-idx3-ubyte.gz to MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to MNIST/raw/t10k-labels-idx1-ubyte.gz
100% 4542/4542 [00:00<00:00, 18006170.86it/s]
Extracting MNIST/raw/t10k-labels-idx1-ubyte.gz to MNIST/raw
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
2022-11-22 13:20:41
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
2022-11-22 13:20:47
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
2022-11-22 13:20:52
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
2022-11-22 13:20:57
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
2022-11-22 13:21:02
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
2022-11-22 13:21:07
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
2022-11-22 13:23:32
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
The task execution is not logged after this point, there seems to be no progress even after a long time, and the CPU usage is stuck at around 25% while the GPU usage is 0.
I am using the following command to start the task:
clearml-task --project ClearMLpractice --name hello_ptl --repo [email protected]:xx/xx.git --branch master --script pytorch_lightning/ptl_mnist.py --args batch_size=64 max_epochs=30 --docker pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime --docker_args "-v /home/xxx/.ssh:/root/.ssh:ro" --queue default
Running the same example on a 8GPU machine directly leads to the following issue:
Traceback (most recent call last):
File "test.py", line 93, in <module>
trainer.fit(model, train_loader, val_loader)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 582, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
mp.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS
Hi @ssetu,
I'll take a look at it your issue, and update you soon.
@Rizwan-Hasan I've fond a solution. Multi-GPU requires interprocess communication. So either the --ipc=host flag
should be used or larger shared memory needs to be allocated using the --shm-size
flag.
@ssetu That's good to hear. Can you please comment the solution code here?
One solution is to add this line in the clear.conf file of the agent:
extra_docker_arguments: ["--ipc=host", ]
Alternatively, we can also use a larger shared memory by specifying
extra_docker_arguments: ["--shm-size=8g", ]
Be careful not to exceed your RAM size when using the latter.
For me the only solution I found to make clearml log scalars when using multiple GPUs is to make the Task
part of the LightningModule
, i.e.:
class LitClassifier(pl.LightningModule):
def __init__(self, hidden_dim=128, learning_rate=1e-3):
super().__init__()
self.task = Task.init(...)