step3 use same memory when I increase GPUs
when I use 4 * A100 80G to run step3 with llama2-7b(actor_model) tiny-llama-1.1B(ref_model),it will used 53848MB memory in generation and in training used 79610MB memory . when I use 8 * A100 80G to run, it will used 55834MB memory in generation and in training used 78216MB memory. Almost the same used memory, and when increasing to 16* A100 80G is the same result. Does use more GPUs is useless?
ds config:
torchrun --nnodes ${tmp_nodes} --nproc_per_node ${tmp_nproc_per_node}
--master_addr ${tmp_master_addr} --node_rank ${tmp_node_rank}
--master_port ${tmp_master_port} ${PROJECT_PATH}/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py
--data_path ${PROJECT_PATH}/applications/DeepSpeed-Chat/data/Dahoas/rm-static
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_generation_batch_size 1
--per_device_training_batch_size 1
--generation_batches 1
--ppo_epochs 1
--max_answer_seq_len 2000
--max_prompt_seq_len 16000
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 2
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_gradient_checkpointing
--critic_gradient_checkpointing
--disable_actor_dropout
--num_warmup_steps 10
--deepspeed --seed 1234
--dtype bf16
--offload
--offload_reference_model
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--enable_hybrid_engine
--output_dir $OUTPUT
--kl_ctl 0.1 | tee $tmp_log_file 2>&1
@Little-rookie-ee This is in line with the design. Since you may use DDP not MDP, zero2 will only split optim and gradient, and the batch size is setup through per_device_xxx, so it will auto-increase as the number of gpu increases. the memory seems same when you increase the number of gpu, but it'll save the whole training time.