DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

step3 use same memory when I increase GPUs

Open Little-rookie-ee opened this issue 2 years ago • 1 comments

when I use 4 * A100 80G to run step3 with llama2-7b(actor_model) tiny-llama-1.1B(ref_model),it will used 53848MB memory in generation and in training used 79610MB memory . when I use 8 * A100 80G to run, it will used 55834MB memory in generation and in training used 78216MB memory. Almost the same used memory, and when increasing to 16* A100 80G is the same result. Does use more GPUs is useless? ds config: torchrun --nnodes ${tmp_nodes} --nproc_per_node ${tmp_nproc_per_node}
--master_addr ${tmp_master_addr} --node_rank ${tmp_node_rank}
--master_port ${tmp_master_port} ${PROJECT_PATH}/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py
--data_path ${PROJECT_PATH}/applications/DeepSpeed-Chat/data/Dahoas/rm-static
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_generation_batch_size 1
--per_device_training_batch_size 1
--generation_batches 1
--ppo_epochs 1
--max_answer_seq_len 2000
--max_prompt_seq_len 16000
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 2
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_gradient_checkpointing
--critic_gradient_checkpointing
--disable_actor_dropout
--num_warmup_steps 10
--deepspeed --seed 1234
--dtype bf16
--offload
--offload_reference_model
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--enable_hybrid_engine
--output_dir $OUTPUT
--kl_ctl 0.1 | tee $tmp_log_file 2>&1

Little-rookie-ee avatar Nov 24 '23 09:11 Little-rookie-ee

@Little-rookie-ee This is in line with the design. Since you may use DDP not MDP, zero2 will only split optim and gradient, and the batch size is setup through per_device_xxx, so it will auto-increase as the number of gpu increases. the memory seems same when you increase the number of gpu, but it'll save the whole training time.

EeyoreLee avatar Dec 20 '23 05:12 EeyoreLee