verl icon indicating copy to clipboard operation
verl copied to clipboard

why multi node train grpo is slow than one node?

Open wy20907104 opened this issue 5 months ago • 3 comments

hello,i train grpo with nnodes=1, the trian config as flow: ... data.train_batch_size=256
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
... the log is: Training Progress: 0%| | 1/540 [09:09<82:13:02, 549.13s/it] Training Progress: 0%| | 2/540 [17:52<79:50:22, 534.24s/it]

then i train grpo with nnodes=2, the train config is the same as one node: ... data.train_batch_size=256
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
... the log is: Training Progress: 0%| | 1/540 [11:50<106:20:42, 710.28s/it]

training nodes are using InfiniBand (IB) network for communication, and supported RDMA why 2 nodes training is slower than one node time per iterator, i think 2 nodes should 1/2 time cost of one node? the total cost is also slower than one node

wy20907104 avatar Jul 16 '25 07:07 wy20907104

same problem. have you found any solutions?

Lan13 avatar Jul 21 '25 02:07 Lan13

i have not understand this phenomenon, i think the max time cost in rollout, so 2 node should faster than 1 node

wy20907104 avatar Jul 22 '25 08:07 wy20907104

same problem. help me

yjch00 avatar Dec 04 '25 05:12 yjch00