anomalib Training With Mult-GPUS stopped

Describe the bug

Hello ,Author.

i am training CFA model and it works good with single GPU, but it was stucked when i use multi-GPUS.

thx for your reply

Dataset

MVTec

Model

Other (please specify in the field below)

Steps to reproduce the behavior

complete yaml:

dataset:
  name: mvtec #options: [mvtec, btech, folder]
  format: mvtec
  path: ./datasets/MVTec
  category: P264
  task: segmentation
  train_batch_size: 4
  eval_batch_size: 4
  inference_batch_size: 4
  num_workers: 8
  image_size: 512 # dimensions to which images are resized (mandatory)
  center_crop: null # dimensions to which images are center-cropped after resizing (optional)
  normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
  transform_config:
    train: null
    eval: null
  test_split_mode: from_dir # options: [from_dir, synthetic]
  test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
  val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
  val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)

model:
  name: cfa
  backbone: wide_resnet50_2
  gamma_c: 1
  gamma_d: 1
  num_nearest_neighbors: 3
  num_hard_negative_features: 3
  radius: 1e-5
  lr: 1e-3
  weight_decay: 5e-4
  amsgrad: true
  early_stopping:
    patience: 500
    metric: pixel_AUROC
    mode: max
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - AUROC
  pixel:
    - AUROC
  threshold:
    adaptive: true
    image_default: null
    pixel_default: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 0
  path: ./results

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: null # options: torch, onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 2
  devices: 0,1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 500
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0 # Don't validate before extracting features.
  log_every_n_steps: 50
  accelerator: gpu # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle

OS information

OS information:

OS: [e.g. Ubuntu 20.04]
Python version: [e.g. 3.8.10]
Anomalib version: [e.g. 0.3.6]
PyTorch version: [e.g. 1.9.0]
CUDA/cuDNN version: [e.g. 11.1]
GPU models and configuration: [e.g. 2x GeForce RTX 3090]
Any other relevant information: [e.g. I'm using a custom dataset]

Expected behavior

Just train

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

dataset:
  name: mvtec #options: [mvtec, btech, folder]
  format: mvtec
  path: ./datasets/MVTec
  category: P264
  task: segmentation
  train_batch_size: 4
  eval_batch_size: 4
  inference_batch_size: 4
  num_workers: 8
  image_size: 512 # dimensions to which images are resized (mandatory)
  center_crop: null # dimensions to which images are center-cropped after resizing (optional)
  normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
  transform_config:
    train: null
    eval: null
  test_split_mode: from_dir # options: [from_dir, synthetic]
  test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
  val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
  val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)

model:
  name: cfa
  backbone: wide_resnet50_2
  gamma_c: 1
  gamma_d: 1
  num_nearest_neighbors: 3
  num_hard_negative_features: 3
  radius: 1e-5
  lr: 1e-3
  weight_decay: 5e-4
  amsgrad: true
  early_stopping:
    patience: 500
    metric: pixel_AUROC
    mode: max
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - AUROC
  pixel:
    - AUROC
  threshold:
    adaptive: true
    image_default: null
    pixel_default: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 0
  path: ./results

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: null # options: torch, onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 2
  devices: 0,1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 500
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0 # Don't validate before extracting features.
  log_every_n_steps: 50
  accelerator: gpu # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle

Logs

2023-05-30 08:54:07,906 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2023-05-30 08:54:07,907 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/user/conda/lib/python3.10/site-packages/anomalib/utils/callbacks/__init__.py:142: UserWarning: Export option: None not found. Defaulting to no model export
  warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (cuda), used: True
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2023-05-30 08:54:07,918 - anomalib - INFO - Training the model.
2023-05-30 08:54:08,082 - anomalib.data.mvtec - INFO - Found the dataset.
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
/home/user/conda/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (5.1.0)/charset_normalizer (2.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
To use wandb logger install it using `pip install wandb`
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:238: UserWarning: The seed value is now fixed to 0. Up to v0.3.7, the seed was not fixed when the seed value was set to 0. If you want to use the random seed, please select `None` for the seed value (`null` in the YAML file) or remove the `seed` key from the YAML file.
  warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:275: UserWarning: config.project.unique_dir is set to False. This does not ensure that your results will be written in an empty directory and you may overwrite files.
  warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:299: DeprecationWarning: adaptive will be deprecated in favor of method in config.metrics.threshold in a future release
  warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:306: DeprecationWarning: image_default will be deprecated in favor of manual_image in config.metrics.threshold in a future release.
  warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:316: DeprecationWarning: pixel_default will be deprecated in favor of manual_pixel in config.metrics.threshold in a future release.
  warn(
[rank: 1] Global seed set to 0
2023-05-30 08:54:11,179 - anomalib.data - INFO - Loading the datamodule
2023-05-30 08:54:11,180 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2023-05-30 08:54:11,180 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2023-05-30 08:54:11,180 - anomalib.models - INFO - Loading the model.
2023-05-30 08:54:11,180 - anomalib.models.components.base.anomaly_module - INFO - Initializing CfaLightning model.
/home/user/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/user/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=Wide_ResNet50_2_Weights.IMAGENET1K_V1`. You can also use `weights=Wide_ResNet50_2_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
2023-05-30 08:54:12,518 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2023-05-30 08:54:12,518 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/user/conda/lib/python3.10/site-packages/anomalib/utils/callbacks/__init__.py:142: UserWarning: Export option: None not found. Defaulting to no model export
  warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2023-05-30 08:54:12,529 - anomalib - INFO - Training the model.
[rank: 1] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
2023-05-30 08:54:12,818 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-30 08:54:22,821 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:54:32,823 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:54:42,826 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:54:52,834 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:02,843 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:12,852 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:22,855 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:32,858 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:42,860 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:52,862 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:56:02,869 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:56:12,870 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:56:22,880 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
^C2023-05-30 08:56:23,549 - anomalib - INFO - Loading the best model weights.

Code of Conduct

[X] I agree to follow this project's Code of Conduct

May 30 '23 09:05 Ace-blue

@Ace-blue, the first thing I can think of is CFA does not yet have a full torch implementation. It is using KMeans from sklearn, which might block the multi-gpu training.

https://github.com/openvinotoolkit/anomalib/blob/5eff4e67cc97da0cd058c10e66efd87ab2c3dc85/src/anomalib/models/cfa/torch_model.py#L17

May 30 '23 09:05 samet-akcay

@Ace-blue, the first thing I can think of is CFA does not yet have a full torch implementation. It is using KMeans from sklearn, which might block the multi-gpu training.

https://github.com/openvinotoolkit/anomalib/blob/5eff4e67cc97da0cd058c10e66efd87ab2c3dc85/src/anomalib/models/cfa/torch_model.py#L17

thx, but i tried with some other models like cflow, draem, they all cant be trained with multi-GPUS and the bug is same.

May 30 '23 10:05 Ace-blue

Thanks for letting me know. This is another issue then. We'll investigate this

May 30 '23 10:05 samet-akcay

Thanks for letting me know. This is another issue then. We'll investigate this

I encountered the same problem, did you solve it?

Jun 14 '23 09:06 wsj-create

Thanks for letting me know. This is another issue then. We'll investigate this

I encountered the same problem, did you solve it? I have tried the efficientAD，it is the same problem。Do we must count the the pictures in datasets and the number of the GPUS to match the bacthsize?

Jul 13 '23 03:07 StarShang

Thanks for letting me know. This is another issue then. We'll investigate this

Hello author, has this issue been resolved？

Sep 25 '23 05:09 archyin

We are not working on it nowadays. We first need to release anomalib v1.0.

Sep 26 '23 13:09 samet-akcay

Duplicate of #1449

Mar 22 '24 09:03 samet-akcay