anomalib
anomalib copied to clipboard
✨ Add Multi-GPU Support to v1.1
What is the motivation for this task?
I'm going to train custom dataset using EfficientAd model. How do I train or test using Multi-GPU? Please, tell me which command is used.
Describe the solution you'd like
Currently, I'm training using only single devices.
$ python3 tools/train.py --model efficient_ad
Additional context
No response
@samet-akcay Is there any code implementation for using multiple GPUs?
@lemonbuilder, this has now been added to the roadmap.
This task would close the following issues: #930 #1110 #930 #1398
@samet-akcay , sorry, I got the error when training with multi-GPU with v1. How can I use only 1 GPU for example id 3 for training? Now I'm using this code for training:
# Import the required modules
from anomalib.data import MVTec
from anomalib.models import EfficientAd
from anomalib.engine import Engine
# Initialize the datamodule, model and engine
datamodule = MVTec()
model = EfficientAd()
engine = Engine()
# Train the model
engine.fit(datamodule=datamodule, model=model)
@nguyenanhtuan1008, you could refer to this link. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html#choosing-gpu-devices
In this case, you could initialize the Engine class as ;
engine = Engine(accelerator="gpu", devices="3")
@samet-akcay Thank you so much. I got the training work but still got error after 1 epoch so I gave up and using the single GPU right now.
Hello, I wish to take this issue. Thank you @samet-akcay, and the good work.
Hi @samet-akcay I would like to work on this issue. Can I take this issue?
@RitikaxShakya, thanks for your interest. I've totally missed this one, but looks like @Bepitic already shown interest in this. If he doesn't want to work on it, it could be all yours. How does that sound?
@Bepitic, are you still interested in this issue? If not @RitikaxShakya can take it?
Yes for sure, since no one confirmed me I also forgot about the one of multi-gpu 😅
sorry about that
@RitikaxShakya, all yours then
.take
@blaz-r @samet-akcay Hello! I need help regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.
I am not that familiar with these topics within the Anomalib. @ashwinvaidya17 could you provide some insight here?
@ashwinvaidya17 Hello! Please help me regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.
@RitikaxShakya currently we override the number of devices to 1 in Engine and the CLI.
To start with, we should remove these lines. https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/engine/engine.py#L305 https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/utils/config.py#L130
Doing this will break a bunch of stuff across the repo.
- For example, all the
trainer.model
calls will break. https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/callbacks/checkpoint.py#L38 These should be replaced withtrainer.lightning_module
- You will also need to test each model to replace all
.cpu()
operations as we move large tensors out of CUDA memory to mitigate OOM issues. In case of Padim, the following line will break https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/models/image/padim/lightning_model.py#L86 as the embeddings are on cpu. These should be moved tocuda
before callingmodel.fit
. Something as simple asto(self.device)
should fix it. From my initial experiments this isn't sufficient to make the model work but it's a good start. - I am not sure if this is affected by distributed training but you might also need to look at thresholding and metrics computation https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/callbacks/thresholding.py#L182
I might have missed something so feel free to report any difficulties you run into.
Using latest anomalib 1.1.0 from pip I create the Engine like so:
engine = Engine(
max_epochs=100,
task=task_type,
accelerator="gpu",
devices=-1,
)
By passing devices=-1
I thought training would utilize all my available GPUs.
PyTorch can see them, I get this output from training:
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
However, when looking at the output from nvidia-smi
I can see that only one GPU is used.
I also tried to pass strategy="ddp"
when creating the engine, however, then I get this error:
Traceback (most recent call last):
File "/data/scratch/mkw-anomalib/anomalib-test.py", line 41, in <module>
train()
File "/data/scratch/mkw-anomalib/anomalib-test.py", line 36, in train
engine.fit(datamodule=datamodule, model=model)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/engine/engine.py", line 540, in fit
self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
results = self._run_stage()
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
self.fit_loop.run()
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 269, in advance
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 295, in on_train_batch_end
if self._should_skip_saving_checkpoint(trainer):
File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/callbacks/checkpoint.py", line 38, in _should_skip_saving_checkpoint
is_zero_or_few_shot = trainer.model.learning_type in [LearningType.ZERO_SHOT, LearningType.FEW_SHOT]
File "/home/sinntelligence/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'learning_type'
Is this a bug or how can I use all my GPUs for training with anomalig 1.1.0?
@haimat, we aim to enable multi-gpu support in v1.2
@samet-akcay Thanks for your quick reply. So around end of July?
yeah, that's the plan hopefully :)
Hello guys, any news on this, do you have an ETA on 1.2 and multui-GPU training?
@samet-akcay Hello Samet, can you estimate when multi-GPU training will be available?
@haimat unfortunately we don't have an exact timeline for this. Currently, we are busy with some other high-priority tasks.
@samet-akcay Use the following parameters to perform multi-GPU, but the result is a single GPU. How to set epoch? I am very anxious? from anomalib.models import Patchcore from anomalib.engine import Engine
Create the model and engine
model = Patchcore() engine = Engine(max_epochs=30,task="classification",accelerator='gpu',devices=3)
Train a Patchcore model on the given datamodule
engine.train(datamodule=datamodule, model=model)
What is the default epoch?
@goldwater668, as mentioned above, multi-GPU is not currently supported. devices
parameter is over-written here to avoid any errors caused by multi-gpu issues.
https://github.com/openvinotoolkit/anomalib/blob/2bd2842ec33c6eedb351d53cf1a1082069ff69dc/src/anomalib/engine/engine.py#L327-L328
@samet-akcay Can you specify the GPU ID? engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2]) I specify the GPU ID number according to the above code. Why do the following results still appear? Can’t I specify it? LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] When I am training Patchcore, I have insufficient CUDA memory. How should I adjust the parameters?
Yes, you could specify the GPU ID.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
This is an automated log generated by Lightning. You could ignore this.
If you are experiencing out of memory issues with Patchcore, your dataset is probably to large to fit a PatchCore memory bank. You could configure Patchcore arguments to make it more memory efficient. For example, changing the backbone to a more efficient backbone, changing the layers to extract etc. https://anomalib.readthedocs.io/en/v1.1.0/markdown/guides/reference/models/image/patchcore.html
https://github.com/openvinotoolkit/anomalib/blob/2bd2842ec33c6eedb351d53cf1a1082069ff69dc/src/anomalib/models/image/patchcore/lightning_model.py#L25-L48
@samet-akcay I specified GPU cards 1 and 2 for training. However, during training, I still trained on card 0. Is there anything wrong with the GPU specified in the above settings? engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2])
@goldwater668, you currently cannot set multiple GPUs as it will be mapped back to a single GPU.
With that being said, I noticed that Engine always configures the device to run on the default GPU even when the user explicitly chooses a specific GPU. I've created a PR to fix this https://github.com/openvinotoolkit/anomalib/pull/2256