[Bug]: Warnings when training on multiple GPUs with 2.0.0

Open haimat opened this issue 9 months ago • 0 comments

Describe the bug

When I train a RD model with anomalib 2.0.0 on a machine with mulitple GPUs, I get this warning message right before the first epoch starts:

/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

Then after the first epoch, before the second epoch starts, I get this error:

/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

Is this a bug or something I should address on my end?

Dataset

Folder

Model

Reverse Distillation

Steps to reproduce the behavior

Just train a model with anomalib 2.0.0 and recent lightning 2.5.1

OS information

OS information:

OS: 22.04
Python version: 3.10
Anomalib version: 2.0.0
PyTorch version: 2.6.0
CUDA/cuDNN version: 12.6
GPU models and configuration: 4x NVIDIA RTX A6000
Any other relevant information: I'm using a custom dataset

Expected behavior

I would expect that anomalib would pass the correct batch size to lightning.

Screenshots

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

none

Logs

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
INFO:lightning_fabric.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name           | Type                     | Params | Mode
--------------------------------------------------------------------
0 | pre_processor  | PreProcessor             | 0      | train
1 | post_processor | PostProcessor            | 0      | train
2 | evaluator      | Evaluator                | 0      | train
3 | model          | ReverseDistillationModel | 89.0 M | train
4 | loss           | ReverseDistillationLoss  | 0      | train
--------------------------------------------------------------------
89.0 M    Trainable params
0         Non-trainable params
89.0 M    Total params
356.009   Total estimated model params size (MB)
347       Modules in train mode
0         Modules in eval mode
Epoch 0:   0%|                                                                                                                 | 0/93 [00:00<?, ?it/s]/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/core/module.py:512: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [12:22<00:00,  0.13it/s, train_loss_step=0.147]
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0: 100%|█████████████████████████████████████████████████████████| 93/93 [12:22<00:00,  0.13it/s, train_loss_step=0.147, train_loss_epoch=0.474]INFO:lightning_fabric.utilities.rank_zero:Epoch 0, global step 93: 'train_loss' reached 0.47396 (best 0.47396), saving model to '/data/scratch/anomalib-2/results/ReverseDistillation/anomalib/latest/checkpoints/epoch=0-step=93.ckpt' as top 1
Epoch 1:  19%|███████████                                              | 18/93 [02:17<09:31,  0.13it/s, train_loss_step=0.131, train_loss_epoch=0.474]

Code of Conduct

[x] I agree to follow this project's Code of Conduct

Mar 26 '25 13:03 haimat