DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Reproduction Failure : 8*A100 40G run opt-13b stage3_RLHF OOM
- I use offload、gradient_checkpointing and zero_stage 3, and still get OOM result
- I test it in 8*A100 80G, and see about 55G GPU memory consumption via "nvidia-smi"
- my scripts:
deepspeed --master_port 12346 main.py \
--data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets
openai/webgpt_comparisons stanfordnlp/SHP \
--data_split 2,4,4 \
--actor_model_name_or_path facebook/opt-13b \
--critic_model_name_or_path facebook/opt-350m \
--num_padding_at_beginning 1 \
--per_device_train_batch_size 1 \
--per_device_mini_train_batch_size 1 \
--generation_batch_numbers 1 \
--ppo_epochs 1 \
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--actor_weight_decay 0.1 \
--critic_weight_decay 0.1 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
--enable_hybrid_engine \
--actor_zero_stage 3 \
--critic_zero_stage 3 \
--output_dir $OUTPUT \
--actor_gradient_checkpointing \
--critic_gradient_checkpointing \
--offload_reference_model
Could someone teach me to run in 8*A100 40G ?
I encountered the same problem as well.
Do you encounter OOM during generation phase? Please take a look at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning#-how-to-train-rlhf. If so, you can add --inference_tp_size 4
to avoid it
Closed as no followup