anomalib icon indicating copy to clipboard operation
anomalib copied to clipboard

✨ Add Multi-GPU Support to v1.1

Open lemonbuilder opened this issue 1 year ago • 21 comments

What is the motivation for this task?

I'm going to train custom dataset using EfficientAd model. How do I train or test using Multi-GPU? Please, tell me which command is used.

Describe the solution you'd like

Currently, I'm training using only single devices.
$ python3 tools/train.py --model efficient_ad

Additional context

No response

lemonbuilder avatar Oct 30 '23 08:10 lemonbuilder

@samet-akcay Is there any code implementation for using multiple GPUs?

lemonbuilder avatar Nov 05 '23 06:11 lemonbuilder

@lemonbuilder, this has now been added to the roadmap.

This task would close the following issues: #930 #1110 #930 #1398

samet-akcay avatar Feb 09 '24 15:02 samet-akcay

@samet-akcay , sorry, I got the error when training with multi-GPU with v1. How can I use only 1 GPU for example id 3 for training? Now I'm using this code for training:

# Import the required modules
from anomalib.data import MVTec
from anomalib.models import EfficientAd
from anomalib.engine import Engine

# Initialize the datamodule, model and engine
datamodule = MVTec()
model = EfficientAd()

engine = Engine()

# Train the model
engine.fit(datamodule=datamodule, model=model)

nguyenanhtuan1008 avatar Feb 12 '24 06:02 nguyenanhtuan1008

@nguyenanhtuan1008, you could refer to this link. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html#choosing-gpu-devices

In this case, you could initialize the Engine class as ;

engine = Engine(accelerator="gpu", devices="3")

samet-akcay avatar Feb 12 '24 10:02 samet-akcay

@samet-akcay Thank you so much. I got the training work but still got error after 1 epoch so I gave up and using the single GPU right now.

nguyenanhtuan1008 avatar Feb 15 '24 02:02 nguyenanhtuan1008

Hello, I wish to take this issue. Thank you @samet-akcay, and the good work.

Bepitic avatar Mar 13 '24 22:03 Bepitic

Hi @samet-akcay I would like to work on this issue. Can I take this issue?

RitikaxShakya avatar Mar 28 '24 12:03 RitikaxShakya

@RitikaxShakya, thanks for your interest. I've totally missed this one, but looks like @Bepitic already shown interest in this. If he doesn't want to work on it, it could be all yours. How does that sound?

samet-akcay avatar Mar 28 '24 12:03 samet-akcay

@Bepitic, are you still interested in this issue? If not @RitikaxShakya can take it?

samet-akcay avatar Mar 28 '24 12:03 samet-akcay

Yes for sure, since no one confirmed me I also forgot about the one of multi-gpu 😅

Bepitic avatar Mar 28 '24 13:03 Bepitic

sorry about that

samet-akcay avatar Mar 28 '24 13:03 samet-akcay

@RitikaxShakya, all yours then

samet-akcay avatar Mar 28 '24 13:03 samet-akcay

.take

RitikaxShakya avatar Mar 28 '24 13:03 RitikaxShakya

@blaz-r @samet-akcay Hello! I need help regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.

RitikaxShakya avatar Apr 06 '24 21:04 RitikaxShakya

I am not that familiar with these topics within the Anomalib. @ashwinvaidya17 could you provide some insight here?

blaz-r avatar Apr 10 '24 10:04 blaz-r

@ashwinvaidya17 Hello! Please help me regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.

RitikaxShakya avatar Apr 12 '24 06:04 RitikaxShakya

@RitikaxShakya currently we override the number of devices to 1 in Engine and the CLI.

To start with, we should remove these lines. https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/engine/engine.py#L305 https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/utils/config.py#L130

Doing this will break a bunch of stuff across the repo.

  1. For example, all the trainer.model calls will break. https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/callbacks/checkpoint.py#L38 These should be replaced with trainer.lightning_module
  2. You will also need to test each model to replace all .cpu() operations as we move large tensors out of CUDA memory to mitigate OOM issues. In case of Padim, the following line will break https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/models/image/padim/lightning_model.py#L86 as the embeddings are on cpu. These should be moved to cuda before calling model.fit. Something as simple as to(self.device) should fix it. From my initial experiments this isn't sufficient to make the model work but it's a good start.
  3. I am not sure if this is affected by distributed training but you might also need to look at thresholding and metrics computation https://github.com/openvinotoolkit/anomalib/blob/debdae70bc6e089958eaefa066b4bcd79711bb23/src/anomalib/callbacks/thresholding.py#L182

I might have missed something so feel free to report any difficulties you run into.

ashwinvaidya17 avatar Apr 12 '24 08:04 ashwinvaidya17

Using latest anomalib 1.1.0 from pip I create the Engine like so:

    engine = Engine(
        max_epochs=100,
        task=task_type,
        accelerator="gpu",
        devices=-1,
    )

By passing devices=-1 I thought training would utilize all my available GPUs. PyTorch can see them, I get this output from training:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

However, when looking at the output from nvidia-smi I can see that only one GPU is used. I also tried to pass strategy="ddp" when creating the engine, however, then I get this error:

Traceback (most recent call last):
  File "/data/scratch/mkw-anomalib/anomalib-test.py", line 41, in <module>
    train()
  File "/data/scratch/mkw-anomalib/anomalib-test.py", line 36, in train
    engine.fit(datamodule=datamodule, model=model)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/engine/engine.py", line 540, in fit
    self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 269, in advance
    call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 295, in on_train_batch_end
    if self._should_skip_saving_checkpoint(trainer):
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/callbacks/checkpoint.py", line 38, in _should_skip_saving_checkpoint
    is_zero_or_few_shot = trainer.model.learning_type in [LearningType.ZERO_SHOT, LearningType.FEW_SHOT]
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'learning_type'

Is this a bug or how can I use all my GPUs for training with anomalig 1.1.0?

haimat avatar Jun 05 '24 18:06 haimat

@haimat, we aim to enable multi-gpu support in v1.2

samet-akcay avatar Jun 05 '24 19:06 samet-akcay

@samet-akcay Thanks for your quick reply. So around end of July?

haimat avatar Jun 05 '24 19:06 haimat

yeah, that's the plan hopefully :)

samet-akcay avatar Jun 05 '24 19:06 samet-akcay

Hello guys, any news on this, do you have an ETA on 1.2 and multui-GPU training?

haimat avatar Aug 02 '24 08:08 haimat

@samet-akcay Hello Samet, can you estimate when multi-GPU training will be available?

haimat avatar Aug 08 '24 07:08 haimat

@haimat unfortunately we don't have an exact timeline for this. Currently, we are busy with some other high-priority tasks.

ashwinvaidya17 avatar Aug 09 '24 07:08 ashwinvaidya17

@samet-akcay Use the following parameters to perform multi-GPU, but the result is a single GPU. How to set epoch? I am very anxious? from anomalib.models import Patchcore from anomalib.engine import Engine

Create the model and engine

model = Patchcore() engine = Engine(max_epochs=30,task="classification",accelerator='gpu',devices=3)

Train a Patchcore model on the given datamodule

engine.train(datamodule=datamodule, model=model)

What is the default epoch?

watertianyi avatar Aug 19 '24 01:08 watertianyi

@goldwater668, as mentioned above, multi-GPU is not currently supported. devices parameter is over-written here to avoid any errors caused by multi-gpu issues. https://github.com/openvinotoolkit/anomalib/blob/2bd2842ec33c6eedb351d53cf1a1082069ff69dc/src/anomalib/engine/engine.py#L327-L328

samet-akcay avatar Aug 19 '24 07:08 samet-akcay

@samet-akcay Can you specify the GPU ID? engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2]) I specify the GPU ID number according to the above code. Why do the following results still appear? Can’t I specify it? LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] When I am training Patchcore, I have insufficient CUDA memory. How should I adjust the parameters?

watertianyi avatar Aug 19 '24 07:08 watertianyi

Yes, you could specify the GPU ID.

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] This is an automated log generated by Lightning. You could ignore this.

If you are experiencing out of memory issues with Patchcore, your dataset is probably to large to fit a PatchCore memory bank. You could configure Patchcore arguments to make it more memory efficient. For example, changing the backbone to a more efficient backbone, changing the layers to extract etc. https://anomalib.readthedocs.io/en/v1.1.0/markdown/guides/reference/models/image/patchcore.html

https://github.com/openvinotoolkit/anomalib/blob/2bd2842ec33c6eedb351d53cf1a1082069ff69dc/src/anomalib/models/image/patchcore/lightning_model.py#L25-L48

samet-akcay avatar Aug 19 '24 08:08 samet-akcay

@samet-akcay I specified GPU cards 1 and 2 for training. However, during training, I still trained on card 0. Is there anything wrong with the GPU specified in the above settings? engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2])

watertianyi avatar Aug 19 '24 08:08 watertianyi

@goldwater668, you currently cannot set multiple GPUs as it will be mapped back to a single GPU.

With that being said, I noticed that Engine always configures the device to run on the default GPU even when the user explicitly chooses a specific GPU. I've created a PR to fix this https://github.com/openvinotoolkit/anomalib/pull/2256

samet-akcay avatar Aug 19 '24 09:08 samet-akcay