DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] DeepSpeed Cuda OOM on SwinUNETR from MONAI

Open majercakdavid opened this issue 1 year ago • 7 comments

Describe the bug I'm trying to run training of SwinUNETR model on a multi-GPU node (4xV00 - 16GB VRAM) with effective batch size per GPU of 1 and sample size 96x96x96. However, even after many tweak in DS config I'm still getting CUDA OOM error.

To Reproduce Steps to reproduce the behavior:

  1. Clone 'MONAI SwinUNETR'
  2. Use deepspeed init with following configuration:
{
    "train_micro_batch_size_per_gpu": 1,
    "steps_per_print": 1,
    "fp16": {
        "enabled": true    },
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": [
                0.8,
                0.999            ],
            "eps": 1e-8,
            "weight_decay": 3e-7        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 0.001,
            "warmup_num_steps": 100        }
    },
    "wall_clock_breakdown": false,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu"        },
        "offload_param": {
            "device": "cpu"        },
        "contiguous_gradients": true,
        "overlap_comm": false,
        "allgather_bucket_size": 5e5,
        "reduce_bucket_size": 5e5    },
    "zero_allow_untested_optimizer": false,
    "activation_checkpointing": {
        "partition_activations": true,
        "cpu_checkpointing": false,
        "contiguous_memory_optimization": false,
        "number_checkpoints": null,
        "synchronize_checkpoint_boundary": false,
        "profile": false    }
}
  1. Get OOM error

Expected behavior Training proceeds without OOM error

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU 4xV100 - 16GB VRAM
  • Python version: 3.8

Launcher context AML pipeline with PyTorch distribution:

distribution:
  type: pytorch

Docker context mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.6-cudnn8-ubuntu20.04

Additional context

  • Am I missing any further optimizations I can do?
  • Is it possible to make train_batch_sizesmaller than #GPUs such that the GPUs can then share memory
  • How to effectively enable model parallel in DeepSpeed it that is even possible

majercakdavid avatar Mar 02 '23 21:03 majercakdavid

@majercakdavid, can you please share log/stack trace?

tjruwase avatar Mar 02 '23 21:03 tjruwase

@tjruwase surely, this is log for 0-th process std_log_process_0.txt

majercakdavid avatar Mar 02 '23 21:03 majercakdavid

Based on your log, it looks like OOM is caused by activation memory consumption. The screenshot below shows that deepseed.init() offloaded model state so that GPU memory is almost empty image

ZeRO helps with memory consumption of model states, but not of activations. You will need to use gradient checkpointing to fit these activations. The link you provided shows some example of gradient checkpointing usage. Have you tried those? Also, can you share your actual command line? Thanks!

tjruwase avatar Mar 04 '23 11:03 tjruwase

@majercakdavid, do you still need this opened?

tjruwase avatar Mar 13 '23 14:03 tjruwase

@tjruwase unfortunately yes. After I did checkpointing for the forward pass I still get OOM error for backward pass. Let me attach the logs: std_log_process_0 (2).txt

majercakdavid avatar Mar 13 '23 20:03 majercakdavid

@tjruwase if I use fp16 I can use 96x96x96, however I get NaN for loss. If I use bfloat16 I get loss values and can use 64x64x64 tensor as input but as soon as I use 96x96x96 I get following error: std_log_process_0 (3).txt

majercakdavid avatar Mar 15 '23 18:03 majercakdavid

It seems you are running out of GPU memory. Can you share logs for 64x64x64 with bfloat16?

tjruwase avatar Mar 15 '23 19:03 tjruwase

@tjruwase sorry for late response: std_log_process_0 (4).txt

majercakdavid avatar Mar 30 '23 21:03 majercakdavid