why multi node train grpo is slow than one node?
hello,i train grpo with nnodes=1, the trian config as flow:
...
data.train_batch_size=256
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
...
the log is:
Training Progress: 0%| | 1/540 [09:09<82:13:02, 549.13s/it]
Training Progress: 0%| | 2/540 [17:52<79:50:22, 534.24s/it]
then i train grpo with nnodes=2, the train config is the same as one node:
...
data.train_batch_size=256
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
...
the log is:
Training Progress: 0%| | 1/540 [11:50<106:20:42, 710.28s/it]
training nodes are using InfiniBand (IB) network for communication, and supported RDMA why 2 nodes training is slower than one node time per iterator, i think 2 nodes should 1/2 time cost of one node? the total cost is also slower than one node
same problem. have you found any solutions?
i have not understand this phenomenon, i think the max time cost in rollout, so 2 node should faster than 1 node
same problem. help me