DeepSpeed
DeepSpeed copied to clipboard
[BUG] High GPU memory use when fine-tuning Flan-T5-xxl (11B) using stage 3
Hi folks,
I've been recently trying to fine-tune Flan-T5 on a node with 8 NVIDIA RTX A6000 GPUs (48GB VRAM) and ~500 GB RAM. However, I have been unable to get the VRAM usage to decrease even when increasing the number of GPUs. Notably, optimizer offloading does seem to work and enables me to fit the whole model with up to 4 samples per GPU (context size: 256 input, 256 target = 512 total), but I had expected to be able to get batch sizes similar to the ones in this tutorial: https://www.philschmid.de/fine-tune-flan-t5-deepspeed, where Phil is even able to fine-tune the model on 4 GPUs with only 24GB VRAM each using similar context lengths.
For instance, when running my training loop on a single GPU with batch size 1 using optimizer offloading, memory usage stabilizes at 35678 MiB GPU VRAM and some 250 GB of CPU RAM. When trying to do the same with 1 batch per GPU on either 4 or 8 GPUs, I get 33300-33700 MiB GPU VRAM used per GPU, which I suppose means that deepspeed is not partitioning that many model parameters across GPUs and is mostly doing optimizer offloading. Indeed, when not using optimizer offloading I get OOM errors.
Notably, I'm using a custom training loop (no HF trainer) but I'm following most of the recommendations in this documentation, including turning gradient checkpointing on, which is crucial to even be able to train with batch size 1.
Is there anything I'm missing that might be causing my training not to use less VRAM per GPU when using more devices? Is there something special in the HF Trainer implementation that enables deepspeed to scale nicely with HF models in Phil's example?
Any help would be greatly appreciated!
My ds_config is as follows:
ds_config = {
"train_micro_batch_size_per_gpu": batch_size,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 3e-5,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7,
},
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 3e-5,
"warmup_num_steps": 500,
},
},
"bf16": {"enabled": True},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu", "pin_memory": True},
"overlap_comm": True,
"contiguous_gradients": True,
"sub_group_size": 1e9,
"reduce_bucket_size": hidden_size**2, #hidden_size: 4096
"stage3_prefetch_bucket_size": 0.9 * hidden_size**2,
"stage3_param_persistence_threshold": 10 * hidden_size,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": True,
"round_robin_gradients": True,
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": False,
}
To Reproduce Steps to reproduce the behavior:
- A minimal example can be found here: https://pastebin.com/x3XTqgJx
- It can be run using the following command:
deepspeed --num_gpus 1 script.py --deepspeed
Expected behavior I expect deepspeed to partition more model parameters. Ideally, I'd love to be able to train the whole thing using 8 A6000 GPUs without any optimizer offloading.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ceballosarroyo.a/.conda/envs/text_cr/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/home/ceballosarroyo.a/.conda/envs/text_cr/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
System info:
- OS: CentOS Linux 7
- 1 node with 8 A6000 GPUs, 48GB VRAM each
- Python 3.9
Having almost identical issues.
So far, we have managed to fine-tune with this script only using using 8x Nvidia a100. Likely, changing gradient_accumulation_steps could solve the problem
@alceballosa, @baptistejamin, @jiudingsun01 could you please provide some log snippets that could help understand what is happening. In particular, could you share the memory profiling messages such as the following: https://github.com/microsoft/DeepSpeed/blob/b303fa8b5b27bb7f27d8d9f2ceab8da3f41ba8e6/deepspeed/runtime/zero/stage3.py#L338
Hi @alceballosa thanks for providing the repro script. To get the model to load you should try moving the lines
ds_config = get_ds_confg()
dschf = HfDeepSpeedConfig(ds_config)
to before the pretrained model checkpoints are download (around line 198). This will is partitioned across the GPUs as it is being read by the CPU. Here is some documentation on the HfDeepSpeedConfig. https://huggingface.co/docs/transformers/main_classes/deepspeed