DeepSpeedExamples
DeepSpeedExamples copied to clipboard
step 3 : OOM
Steps 1 and 2 are running normally. When running step 3, I encountered an OOM (out of memory) issue again. Even when the batch size was set to 1, it still didn't work. Does anyone know what the situation is?
在跑step3的时候又遇到显存不够的问题,batch都设置成了1也不行,有人知道什么情况吗?
4 * v100-40G
Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=5e-4 Critic_Lr=5e-6
deepspeed --master_port 12346 main.py
--data_path Hello-SimpleAI/HC3-Chinese
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_mini_train_batch_size 1
--generation_batch_numbers 1
--ppo_epochs 1
--max_answer_seq_len 128
--max_prompt_seq_len 128
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 1
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--num_warmup_steps 100
--deepspeed --seed 1234
--enable_hybrid_engine
--inference_tp_size 2
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--actor_gradient_checkpointing
--critic_gradient_checkpointing
--actor_lora_dim 128
--actor_lora_module_name decoder.layers.
--output_dir $OUTPUT
&> $OUTPUT/training.log
Hey @MAJIN123, what are the actor / critic model architectures?
Hi @EikeKohl ,actor model:LLaMA 7B,critic model:facebook/opt-350m.
试试zero2、3
@AltenLi 还是不行,老哥,还是显存不够,好奇怪的。
Hi, you can try to offload the reference model. Please take a look at the
tks bro @yaozhewei 😯
@MAJIN123 您好,我在用v100跑第三步的时候也遇到了oom的情况,请问您最后是怎么解决的哈,我这边也是把能调的都调到最小了