DeepSpeedExamples single gpu 6.7b lora CUDA OOM with A6000 48G

I am trying to run DeepSpeed-Chat Example with single gpu, Nvidia A6000 48G.

I could run all 3 steps well using 1.3b example. But when I run single_gpu/run_6.7b_lora.sh, I got CUDA Out Of Memory error at step3. Step1 & step2 were run well.

Even after I minimized configurations, I still get OOM. Here is my run_6.7b_lora.sh config:

ACTOR_ZERO_STAGE="--actor_zero_stage 0"
CRITIC_ZERO_STAGE="--critic_zero_stage 0"
ACTOR_MODEL_PATH=../step1_supervised_finetuning/output
CRITIC_MODEL_PATH=../step2_reward_model_finetuning/output

OUTPUT="./output"

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=5e-4
Critic_Lr=5e-6

mkdir -p $OUTPUT

deepspeed --num_gpus 1 main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 1 \
   --per_device_mini_train_batch_size 1 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 128 \
   --max_prompt_seq_len 128 \
   --ppo_epochs 1 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 8 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   ${ACTOR_ZERO_STAGE} \
   ${CRITIC_ZERO_STAGE} ${OFFLOAD}\
   --actor_lora_dim 128 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

And I got OOM especially at gradient_accumulation_steps. Here is my error point:

...
******************[end] Initialized Reward Model [end] (duration: 2.67s)******************
***** Running training *****
Beginning of Epoch 1/1, Total Generation Batches 264292
------------------------------------------------------
Free memory : 7.042725 (GigaBytes)  
Total memory: 47.544312 (GigaBytes)  
Requested memory: 0.304688 (GigaBytes) 
Setting maximum total tokens (input + output) to 512 
WorkSpace: 0x7f8f62000000 
------------------------------------------------------
/home/ados/anaconda3/envs/DeepSpeed/lib/python3.9/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
epoch: 0|step: 0|ppo_ep: 1|act_loss: 0.2626953125|cri_loss: 0.13720703125|unsuper_loss: 0.0
average reward score: 2.8828125
-------------------------------------------------------------------------------------
|E2E latency=3.87s |Gather latency=0.00s (0.00%) |Generate time=3.40s (87.89%) |Training time=0.25s (6.47%) |Others=0.22 (5.65%)|CurSamplesPerSec=0.26 |AvgSamplesPerSec=0.26
epoch: 0|step: 1|ppo_ep: 1|act_loss: 0.72265625|cri_loss: 0.4296875|unsuper_loss: 0.0
average reward score: 3.65625
-------------------------------------------------------------------------------------
|E2E latency=3.31s |Gather latency=0.00s (0.00%) |Generate time=2.87s (86.86%) |Training time=0.26s (7.71%) |Others=0.18 (5.43%)|CurSamplesPerSec=0.30 |AvgSamplesPerSec=0.28
epoch: 0|step: 2|ppo_ep: 1|act_loss: -0.394287109375|cri_loss: -0.1787109375|unsuper_loss: 0.0
average reward score: 4.98046875
-------------------------------------------------------------------------------------
|E2E latency=3.30s |Gather latency=0.00s (0.00%) |Generate time=2.87s (87.01%) |Training time=0.25s (7.64%) |Others=0.18 (5.35%)|CurSamplesPerSec=0.30 |AvgSamplesPerSec=0.29
epoch: 0|step: 3|ppo_ep: 1|act_loss: -0.053619384765625|cri_loss: -0.0135498046875|unsuper_loss: 0.0
average reward score: 5.6015625
-------------------------------------------------------------------------------------
|E2E latency=3.29s |Gather latency=0.00s (0.00%) |Generate time=2.86s (86.82%) |Training time=0.25s (7.71%) |Others=0.18 (5.47%)|CurSamplesPerSec=0.30 |AvgSamplesPerSec=0.29
epoch: 0|step: 4|ppo_ep: 1|act_loss: 0.49560546875|cri_loss: 0.264404296875|unsuper_loss: 0.0
average reward score: 1.3955078125
-------------------------------------------------------------------------------------
|E2E latency=3.29s |Gather latency=0.00s (0.00%) |Generate time=2.86s (86.89%) |Training time=0.25s (7.71%) |Others=0.18 (5.40%)|CurSamplesPerSec=0.30 |AvgSamplesPerSec=0.29
epoch: 0|step: 5|ppo_ep: 1|act_loss: -0.26171875|cri_loss: -0.1119384765625|unsuper_loss: 0.0
average reward score: 4.09765625
-------------------------------------------------------------------------------------
|E2E latency=3.29s |Gather latency=0.00s (0.00%) |Generate time=2.86s (86.94%) |Training time=0.25s (7.67%) |Others=0.18 (5.38%)|CurSamplesPerSec=0.30 |AvgSamplesPerSec=0.29
epoch: 0|step: 6|ppo_ep: 1|act_loss: -0.13427734375|cri_loss: -0.05322265625|unsuper_loss: 0.0
average reward score: 3.005859375
-------------------------------------------------------------------------------------
|E2E latency=3.29s |Gather latency=0.00s (0.00%) |Generate time=2.86s (86.86%) |Training time=0.26s (7.75%) |Others=0.18 (5.39%)|CurSamplesPerSec=0.30 |AvgSamplesPerSec=0.30
Traceback (most recent call last):
  File "/home/workspaces/mohomin/git_temp/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in <module>
    main()
  File "/home/workspaces/mohomin/git_temp/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 439, in main
    actor_loss, critic_loss = trainer.train_rlhf(exp_data)
  File "/home/workspaces/mohomin/git_temp/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 172, in train_rlhf
    self.actor_model.backward(actor_loss)
  File "/home/ados/anaconda3/envs/DeepSpeed/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ados/anaconda3/envs/DeepSpeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1842, in backward
    self.allreduce_gradients()
  File "/home/ados/anaconda3/envs/DeepSpeed/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ados/anaconda3/envs/DeepSpeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in allreduce_gradients
    self.buffered_allreduce_fallback(elements_per_buffer=bucket_size)
  File "/home/ados/anaconda3/envs/DeepSpeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2273, in buffered_allreduce_fallback
    non_expert_grads, expert_grads = self._get_gradients_for_reduction()
  File "/home/ados/anaconda3/envs/DeepSpeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2229, in _get_gradients_for_reduction
    param.grad = torch.zeros(param.size(), dtype=param.dtype, device=param.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 47.54 GiB total capacity; 45.81 GiB already allocated; 85.75 MiB free; 45.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-04-17 16:44:36,269] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 5403
[2023-04-17 16:44:36,270] [ERROR] [launch.py:434:sigkill_handler] ['/home/ados/anaconda3/envs/DeepSpeed/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--actor_model_name_or_path', '../step1_supervised_finetuning/output', '--critic_model_name_or_path', '../step2_reward_model_finetuning/output', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '1', '--per_device_mini_train_batch_size', '1', '--generation_batch_numbers', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '128', '--max_prompt_seq_len', '128', '--ppo_epochs', '1', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--actor_weight_decay', '0.1', '--critic_weight_decay', '0.1', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '8', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--actor_zero_stage', '0', '--critic_zero_stage', '0', '--actor_lora_dim', '128', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--enable_hybrid_engine', '--output_dir', './output'] exits with return code = 1

Environments

deepspeed-0.9.1+cc67f22f
CUDA 11.7
torch 2.0.0
python 3.9.16

I also tried with --only_optimize_lora but got a same error. Is there any possible way to run 6.7b_lora model on 48G single gpu? Thank you for any help.

Apr 17 '23 08:04 HyeongminMoon

I used 8* v100 32g, which worked for 6.7b_lora config, but it still failed for 6.7b config in step 3. It seems a node of 8*a100 40g is the minimal requirement, otherwise only 1.3b config could actually work.

Apr 17 '23 10:04 M1n9X

@M1n9X Thank you for your answer. But I still doubt why 6.7_lora.sh was in the single_gpu training script. If we finally figure out its minimum requirement is single node not single gpu, It would be better to move it.

Apr 18 '23 01:04 HyeongminMoon

Please try adding --offload_reference_model to command line.

Apr 18 '23 18:04 tjruwase

@tjruwase I tried with --offload_reference_model but got same error.

Apr 20 '23 09:04 HyeongminMoon

DeepSpeedExamples DeepSpeedExamples copied to clipboard

single gpu 6.7b lora CUDA OOM with A6000 48G

DeepSpeedExamples
DeepSpeedExamples copied to clipboard