DeepSpeed [BUG] DeepSpeed Cuda OOM on SwinUNETR from MONAI

Describe the bug I'm trying to run training of SwinUNETR model on a multi-GPU node (4xV00 - 16GB VRAM) with effective batch size per GPU of 1 and sample size 96x96x96. However, even after many tweak in DS config I'm still getting CUDA OOM error.

To Reproduce Steps to reproduce the behavior:

Clone 'MONAI SwinUNETR'
Use deepspeed init with following configuration:

{
    "train_micro_batch_size_per_gpu": 1,
    "steps_per_print": 1,
    "fp16": {
        "enabled": true    },
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": [
                0.8,
                0.999            ],
            "eps": 1e-8,
            "weight_decay": 3e-7        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 0.001,
            "warmup_num_steps": 100        }
    },
    "wall_clock_breakdown": false,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu"        },
        "offload_param": {
            "device": "cpu"        },
        "contiguous_gradients": true,
        "overlap_comm": false,
        "allgather_bucket_size": 5e5,
        "reduce_bucket_size": 5e5    },
    "zero_allow_untested_optimizer": false,
    "activation_checkpointing": {
        "partition_activations": true,
        "cpu_checkpointing": false,
        "contiguous_memory_optimization": false,
        "number_checkpoints": null,
        "synchronize_checkpoint_boundary": false,
        "profile": false    }
}

Get OOM error

Expected behavior Training proceeds without OOM error

System info (please complete the following information):

OS: Ubuntu 20.04
GPU 4xV100 - 16GB VRAM
Python version: 3.8

Launcher context AML pipeline with PyTorch distribution:

distribution:
  type: pytorch

Docker context mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.6-cudnn8-ubuntu20.04

Additional context

Am I missing any further optimizations I can do?
Is it possible to make train_batch_sizesmaller than #GPUs such that the GPUs can then share memory
How to effectively enable model parallel in DeepSpeed it that is even possible

Mar 02 '23 21:03 majercakdavid

@majercakdavid, can you please share log/stack trace?

Mar 02 '23 21:03 tjruwase

@tjruwase surely, this is log for 0-th process std_log_process_0.txt

Mar 02 '23 21:03 majercakdavid

Based on your log, it looks like OOM is caused by activation memory consumption. The screenshot below shows that deepseed.init() offloaded model state so that GPU memory is almost empty

ZeRO helps with memory consumption of model states, but not of activations. You will need to use gradient checkpointing to fit these activations. The link you provided shows some example of gradient checkpointing usage. Have you tried those? Also, can you share your actual command line? Thanks!

Mar 04 '23 11:03 tjruwase

@majercakdavid, do you still need this opened?

Mar 13 '23 14:03 tjruwase

@tjruwase unfortunately yes. After I did checkpointing for the forward pass I still get OOM error for backward pass. Let me attach the logs: std_log_process_0 (2).txt

Mar 13 '23 20:03 majercakdavid

@tjruwase if I use fp16 I can use 96x96x96, however I get NaN for loss. If I use bfloat16 I get loss values and can use 64x64x64 tensor as input but as soon as I use 96x96x96 I get following error: std_log_process_0 (3).txt

Mar 15 '23 18:03 majercakdavid

It seems you are running out of GPU memory. Can you share logs for 64x64x64 with bfloat16?

Mar 15 '23 19:03 tjruwase

@tjruwase sorry for late response: std_log_process_0 (4).txt

Mar 30 '23 21:03 majercakdavid

DeepSpeed DeepSpeed copied to clipboard

[BUG] DeepSpeed Cuda OOM on SwinUNETR from MONAI

DeepSpeed
DeepSpeed copied to clipboard