DeepSpeedExamples step 3 : OOM

Steps 1 and 2 are running normally. When running step 3, I encountered an OOM (out of memory) issue again. Even when the batch size was set to 1, it still didn't work. Does anyone know what the situation is?

在跑step3的时候又遇到显存不够的问题，batch都设置成了1也不行，有人知道什么情况吗？

4 * v100-40G

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=5e-4 Critic_Lr=5e-6

deepspeed --master_port 12346 main.py
--data_path Hello-SimpleAI/HC3-Chinese
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_mini_train_batch_size 1
--generation_batch_numbers 1
--ppo_epochs 1
--max_answer_seq_len 128
--max_prompt_seq_len 128
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 1
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--num_warmup_steps 100
--deepspeed --seed 1234
--enable_hybrid_engine
--inference_tp_size 2
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--actor_gradient_checkpointing
--critic_gradient_checkpointing
--actor_lora_dim 128
--actor_lora_module_name decoder.layers.
--output_dir $OUTPUT
&> $OUTPUT/training.log