[Bug]: EfficientAD memory growth during validation

Open j99ca opened this issue 2 years ago • 1 comments

Describe the bug

I am working with the EfficientAD model and I have been training the model in AWS Sagemaker. I have noticed the GPU memory usage explodes during validation. I was wondering if this is related to this issue involving the mean and standard deviation calculations?

I am using the current version of anomalib (not the release version which does not have the above fix included). I have attached screen shots showing epoch progress (using Sagemaker metrics/regex) and the GPU memory usage.

I am using a custom dataloader instead which subclasses the Folder one, since my data needs special decoding, but aside from that it should be quite similar in terms of operations.

Dataset

Folder

Model

Other (please specify in the field below)

Steps to reproduce the behavior

Run EfficientAD model in the cloud.

OS information

OS information:

OS: Ubuntu 20.04
Python version: 3.7
Anomalib version: current commit 1f50c95
PyTorch version: 2.0
CUDA/cuDNN version:
GPU models and configuration: Sagemaker ml.g4dn.2xlarge
Any other relevant information: I am using a custom dataset

Expected behavior

That memory usage is stable from the start of training

Screenshots

Pip/GitHub

GitHub

What version/branch did you use?

commit 1f50c95

Configuration YAML

dataset:
  name: thermal
  format: folder
  root: ./root/
  normal_dir: raw_unlabelled # name of the folder containing normal images.
  abnormal_dir: null # name of the folder containing abnormal images.
  task: classification # classification or segmentation
  mask_dir: null #optional
  extensions: .png
  normal_test_dir: null # optional
  train_batch_size: 8
  eval_batch_size: 8
  num_workers: 8
  image_size: 256 # dimensions to which images are resized (mandatory)
  # image_size: [386, 516]
  center_crop: null # dimensions to which images are center-cropped after resizing (optional)
  normalization: none # data distribution to which the images will be normalized: [none, imagenet]
  transform_config:
    train: null
    eval: null
  test_split_mode: from_dir # options: [from_dir, synthetic]
  test_split_ratio: 0.05 # fraction of train images held out testing (usage depends on test_split_mode)
  val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
  val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
  # image normalization params
  min_val: 400.0
  max_val: 1000.0

model:
  name: efficient_ad
  teacher_out_channels: 384
  model_size: small # options: [small, medium]
  lr: 0.0001
  weight_decay: 0.00001
  padding: false
  pad_maps: true # relevant for "padding: false", see EfficientAd in lightning_model.py
  # generic params
  normalization_method: min_max # options: [null, min_max, cdf]

  early_stopping:
    patience: 5
    metric: train_loss
    mode: min

metrics:
  threshold:
    method: adaptive #options: [adaptive, manual]
    manual_image: null
    manual_pixel: null

visualization:
  show_images: False # show images on the screen
  save_images: False # save images to the file system
  log_images: False # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 42
  path: /opt/ml/model

logging:
  logger: [csv] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: true # Logs the model graph to respective logger.

optimization:
  export_mode: openvino # options: torch, onnx, openvino
# PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 1
  devices: 1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 50
  min_epochs: 10
  max_steps: 500000
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0
  log_every_n_steps: 50
  accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle

Logs

See screenshots

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Nov 21 '23 18:11 j99ca

~~Hello. I believe that the problem is related to the issue you linked, but this one: https://github.com/openvinotoolkit/anomalib/pull/1340 describes the main reason. If I understand correctly, you have the latest code, so the memory issue shouldn't be that apparent, but I believe that the increase happens in any case due to implementation loading training and validation dataloader at the same time~~ Upon further inspection, this happens slowly over the epochs, so I don't think it's due to above linked issue. I don't think it necessarily even happens during validation, could be at the start of epoch.

Nov 23 '23 17:11 blaz-r