DeepSpeedExamples Default configuration running with V100-32G causes OOM

When using the official default configuration with a single V100-32G, I found this will cause OOM with the whole pipline. And according to other issues mentioned above, I changed the batch_size from 16 to 8, and it works. So I just wonder if this is normal 'cause I did not see any instruction.

Apr 21 '23 09:04 binderwang

@binderwang, can you please share the command line and stack trace of the OOM? Thanks!

Apr 21 '23 17:04 tjruwase

Assuming you are using the default settings for the single-gpu deployment type, I think this is expected with only 32GB of memory. Lowering the batch size will decrease memory requirements.

Apr 21 '23 20:04 mrwyattii

In the bash file, I just modify the command line as this: deepspeed --num_gpus 1 main.py \ --model_name_or_path facebook/opt-1.3b \ --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \ --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage $ZERO_STAGE \ --only_optimize_lora --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log

And I set zero_stage to 2. With this above command I can run Step1 successfully with a single V100-32G. But If I change nothing with the default command line, it will cause OOM error.

Apr 22 '23 23:04 binderwang

And the error is as

Beginning of Epoch 1/1, Total Micro Batches 2860 Traceback (most recent call last): File "main.py", line 341, in main() File "main.py", line 310, in main outputs = model(**batch, use_cache=False) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1695, in forward loss = self.module(*inputs, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/transformers/models/opt/modeling_opt.py", line 947, in forward return_dict=return_dict, File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/transformers/models/opt/modeling_opt.py", line 710, in forward use_cache=use_cache, File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/transformers/models/opt/modeling_opt.py", line 334, in forward output_attentions=output_attentions, File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11.0/lib/python3.7/site-packages/transformers/models/opt/modeling_opt.py", line 211, in forward attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 30.33 GiB already allocated; 231.50 MiB free; 30.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Apr 22 '23 23:04 binderwang

@binderwang this is expected behavior. The scripts we released are intended for A6000/A100 GPUs. If you are running on a GPU with less memory, you will be required to modify the scripts (e.g., lower the batch size) so that the model can train with less available memory.

Apr 24 '23 17:04 mrwyattii

It is wierd that when I only add the --gradient_checkpointing with the default script, I can successfully run these steps with a V100-32G, and only using half of the GPU memory.

Apr 28 '23 03:04 binderwang

@binderwang the options that will have the greatest impact on memory requirements right now are --gradient_checkpointing, --only_optimize_lora, --zero_stage {0|1|2|3}, and --per_device_*_batch_size.

The examples we provide were tuned for specific systems (primarily A6000 / A100 GPUs) and we encourage users to modify the examples to fit their systems. We are also working on an autotuning feature which will automatically detect optimal settings for the parameters listed above. Stay tuned for those updates!

May 01 '23 18:05 mrwyattii