DeepSpeed [REQUEST] I managed to make DeepSpeed Chat work on Google Colab running on A100/80GB

There are three steps in the entire pipeline. But the help messages when things go wrong can be hard to interpret.

For all steps, it suggests me to use --gradient_checkpointing to resolve the out of memory error in log files. The messages do not explicitly mention which scripts to change.

Also, the option does not apply to the last step. I had to use "--actor_gradient_checkpointing --critic_gradient_checkpointing" instead after reading the log file.

Below is the changes I had to make to make it work for single gpu (A100 , 40 GB).

diff --git applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
index 8d2865c..3cb36cd 100644
--- applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
+++ applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
@@ -16,5 +16,5 @@ fi
 mkdir -p $OUTPUT
 
 deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-1.3b \
-   --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage $ZERO_STAGE \
+   --gradient_accumulation_steps 8 --gradient_checkpointing --lora_dim 128 --zero_stage $ZERO_STAGE \
    --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
diff --git applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
index 435de2c..35ea226 100644
--- applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
+++ applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
@@ -14,5 +14,5 @@ fi
 mkdir -p $OUTPUT
 
 deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-350m \
-   --num_padding_at_beginning 1 --weight_decay 0.1 --disable_dropout --gradient_accumulation_steps 4 --zero_stage $ZERO_STAGE \
+   --num_padding_at_beginning 1 --weight_decay 0.1 --disable_dropout --gradient_checkpointing --gradient_accumulation_steps 4 --zero_stage $ZERO_STAGE \
    --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
diff --git applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh
index b33e3ad..d061e6a 100644
--- applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh
+++ applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh
@@ -22,6 +22,6 @@ mkdir -p $OUTPUT
 deepspeed main.py \
    --actor_model_name_or_path $ACTOR_MODEL_PATH --critic_model_name_or_path $CRITIC_MODEL_PATH \
    --actor_zero_stage $ACTOR_ZERO_STAGE --critic_zero_stage $CRITIC_ZERO_STAGE \
-   --num_padding_at_beginning 1 --gradient_accumulation_steps 2 \
+   --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --actor_gradient_checkpointing --critic_gradient_checkpointing \
    --deepspeed --actor_lora_dim 128 --enable_hybrid_engine --actor_gradient_checkpointing --disable_actor_dropout \
    --output_dir $OUTPUT &> $OUTPUT/training.log

The commond lines I used can be found at https://gist.github.com/chunhualiao/0dec705a10814b3603f20bd6e4fe5a62

Apr 23 '23 03:04 chunhualiao

Thank you @chunhualiao, we’re working on adding an option for automatic feature selection that should help here. We realize it’s not always easy to determine which arguments should be used for various environment and model configs. Stay tuned! :)

Apr 23 '23 18:04 jeffra

Thanks for your share. I trained the model with rtx3090 24GB by single_gpu scripts, the step1 and step2 are same as your. In the step3 I add two parameters "--per_device_train_batch_size 4 --per_device_mini_train_batch_size 4", but the result model is not well to use.

I will try your parameters "--actor_gradient_checkpointing --critic_gradient_checkpointing" in step3, thanks again.

May 13 '23 15:05 zy-sunshine

DeepSpeed DeepSpeed copied to clipboard

[REQUEST] I managed to make DeepSpeed Chat work on Google Colab running on A100/80GB

DeepSpeed
DeepSpeed copied to clipboard