DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Reproduction Failure : 8*A100 40G run opt-13b stage3_RLHF OOM

Open leo5856 opened this issue 1 year ago • 2 comments

  • I use offload、gradient_checkpointing and zero_stage 3, and still get OOM result
  • I test it in 8*A100 80G, and see about 55G GPU memory consumption via "nvidia-smi"
  • my scripts:
deepspeed --master_port 12346 main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets 
    openai/webgpt_comparisons stanfordnlp/SHP \
   --data_split 2,4,4 \
   --actor_model_name_or_path facebook/opt-13b \
   --critic_model_name_or_path facebook/opt-350m \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 1 \
   --per_device_mini_train_batch_size 1 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --actor_zero_stage 3 \
   --critic_zero_stage 3 \
   --output_dir $OUTPUT \
  --actor_gradient_checkpointing \
  --critic_gradient_checkpointing \
   --offload_reference_model

Could someone teach me to run in 8*A100 40G ?

leo5856 avatar Apr 20 '23 08:04 leo5856

I encountered the same problem as well.

MAJIN123 avatar Apr 21 '23 08:04 MAJIN123

Do you encounter OOM during generation phase? Please take a look at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning#-how-to-train-rlhf. If so, you can add --inference_tp_size 4 to avoid it

yaozhewei avatar Apr 24 '23 04:04 yaozhewei

Closed as no followup

yaozhewei avatar May 05 '23 18:05 yaozhewei