Training With Mult-GPUS stopped
Describe the bug
Hello ,Author.
i am training CFA model and it works good with single GPU, but it was stucked when i use multi-GPUS.
thx for your reply
Dataset
MVTec
Model
Other (please specify in the field below)
Steps to reproduce the behavior
complete yaml:
dataset:
name: mvtec #options: [mvtec, btech, folder]
format: mvtec
path: ./datasets/MVTec
category: P264
task: segmentation
train_batch_size: 4
eval_batch_size: 4
inference_batch_size: 4
num_workers: 8
image_size: 512 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
model:
name: cfa
backbone: wide_resnet50_2
gamma_c: 1
gamma_d: 1
num_nearest_neighbors: 3
num_hard_negative_features: 3
radius: 1e-5
lr: 1e-3
weight_decay: 5e-4
amsgrad: true
early_stopping:
patience: 500
metric: pixel_AUROC
mode: max
normalization_method: min_max # options: [null, min_max, cdf]
metrics:
image:
- AUROC
pixel:
- AUROC
threshold:
adaptive: true
image_default: null
pixel_default: null
visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: null # path to which images will be saved
mode: full # options: ["full", "simple"]
project:
seed: 0
path: ./results
logging:
logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
log_graph: false # Logs the model graph to respective logger.
optimization:
export_mode: null # options: torch, onnx, openvino
# PL Trainer Args. Don't add extra parameter here.
trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 2
devices: 0,1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1 # Don't validate before extracting features.
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 500
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0 # Don't validate before extracting features.
log_every_n_steps: 50
accelerator: gpu # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle
OS information
OS information:
- OS: [e.g. Ubuntu 20.04]
- Python version: [e.g. 3.8.10]
- Anomalib version: [e.g. 0.3.6]
- PyTorch version: [e.g. 1.9.0]
- CUDA/cuDNN version: [e.g. 11.1]
- GPU models and configuration: [e.g. 2x GeForce RTX 3090]
- Any other relevant information: [e.g. I'm using a custom dataset]
Expected behavior
Just train
Screenshots
No response
Pip/GitHub
pip
What version/branch did you use?
No response
Configuration YAML
dataset:
name: mvtec #options: [mvtec, btech, folder]
format: mvtec
path: ./datasets/MVTec
category: P264
task: segmentation
train_batch_size: 4
eval_batch_size: 4
inference_batch_size: 4
num_workers: 8
image_size: 512 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
model:
name: cfa
backbone: wide_resnet50_2
gamma_c: 1
gamma_d: 1
num_nearest_neighbors: 3
num_hard_negative_features: 3
radius: 1e-5
lr: 1e-3
weight_decay: 5e-4
amsgrad: true
early_stopping:
patience: 500
metric: pixel_AUROC
mode: max
normalization_method: min_max # options: [null, min_max, cdf]
metrics:
image:
- AUROC
pixel:
- AUROC
threshold:
adaptive: true
image_default: null
pixel_default: null
visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: null # path to which images will be saved
mode: full # options: ["full", "simple"]
project:
seed: 0
path: ./results
logging:
logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
log_graph: false # Logs the model graph to respective logger.
optimization:
export_mode: null # options: torch, onnx, openvino
# PL Trainer Args. Don't add extra parameter here.
trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 2
devices: 0,1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1 # Don't validate before extracting features.
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 500
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0 # Don't validate before extracting features.
log_every_n_steps: 50
accelerator: gpu # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle
Logs
2023-05-30 08:54:07,906 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2023-05-30 08:54:07,907 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/user/conda/lib/python3.10/site-packages/anomalib/utils/callbacks/__init__.py:142: UserWarning: Export option: None not found. Defaulting to no model export
warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (cuda), used: True
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2023-05-30 08:54:07,917 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2023-05-30 08:54:07,918 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2023-05-30 08:54:07,918 - anomalib - INFO - Training the model.
2023-05-30 08:54:08,082 - anomalib.data.mvtec - INFO - Found the dataset.
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
/home/user/conda/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (5.1.0)/charset_normalizer (2.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
To use wandb logger install it using `pip install wandb`
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:238: UserWarning: The seed value is now fixed to 0. Up to v0.3.7, the seed was not fixed when the seed value was set to 0. If you want to use the random seed, please select `None` for the seed value (`null` in the YAML file) or remove the `seed` key from the YAML file.
warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:275: UserWarning: config.project.unique_dir is set to False. This does not ensure that your results will be written in an empty directory and you may overwrite files.
warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:299: DeprecationWarning: adaptive will be deprecated in favor of method in config.metrics.threshold in a future release
warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:306: DeprecationWarning: image_default will be deprecated in favor of manual_image in config.metrics.threshold in a future release.
warn(
/home/user/conda/lib/python3.10/site-packages/anomalib/config/config.py:316: DeprecationWarning: pixel_default will be deprecated in favor of manual_pixel in config.metrics.threshold in a future release.
warn(
[rank: 1] Global seed set to 0
2023-05-30 08:54:11,179 - anomalib.data - INFO - Loading the datamodule
2023-05-30 08:54:11,180 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2023-05-30 08:54:11,180 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2023-05-30 08:54:11,180 - anomalib.models - INFO - Loading the model.
2023-05-30 08:54:11,180 - anomalib.models.components.base.anomaly_module - INFO - Initializing CfaLightning model.
/home/user/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/user/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=Wide_ResNet50_2_Weights.IMAGENET1K_V1`. You can also use `weights=Wide_ResNet50_2_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
2023-05-30 08:54:12,518 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2023-05-30 08:54:12,518 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/user/conda/lib/python3.10/site-packages/anomalib/utils/callbacks/__init__.py:142: UserWarning: Export option: None not found. Defaulting to no model export
warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2023-05-30 08:54:12,529 - anomalib - INFO - Training the model.
[rank: 1] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
2023-05-30 08:54:12,818 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-30 08:54:22,821 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:54:32,823 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:54:42,826 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:54:52,834 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:02,843 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:12,852 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:22,855 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:32,858 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:42,860 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:55:52,862 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:56:02,869 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:56:12,870 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
2023-05-30 08:56:22,880 - torch.distributed.distributed_c10d - INFO - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
^C2023-05-30 08:56:23,549 - anomalib - INFO - Loading the best model weights.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@Ace-blue, the first thing I can think of is CFA does not yet have a full torch implementation. It is using KMeans from sklearn, which might block the multi-gpu training.
https://github.com/openvinotoolkit/anomalib/blob/5eff4e67cc97da0cd058c10e66efd87ab2c3dc85/src/anomalib/models/cfa/torch_model.py#L17
@Ace-blue, the first thing I can think of is CFA does not yet have a full torch implementation. It is using KMeans from sklearn, which might block the multi-gpu training.
https://github.com/openvinotoolkit/anomalib/blob/5eff4e67cc97da0cd058c10e66efd87ab2c3dc85/src/anomalib/models/cfa/torch_model.py#L17
thx, but i tried with some other models like cflow, draem, they all cant be trained with multi-GPUS and the bug is same.
Thanks for letting me know. This is another issue then. We'll investigate this
Thanks for letting me know. This is another issue then. We'll investigate this
I encountered the same problem, did you solve it?
Thanks for letting me know. This is another issue then. We'll investigate this
I encountered the same problem, did you solve it? I have tried the efficientAD,it is the same problem。Do we must count the the pictures in datasets and the number of the GPUS to match the bacthsize?
Thanks for letting me know. This is another issue then. We'll investigate this
Hello author, has this issue been resolved?
We are not working on it nowadays. We first need to release anomalib v1.0.
Duplicate of #1449