RL icon indicating copy to clipboard operation
RL copied to clipboard

policy.train slow at >=32 nodes b/c workers start at different time

Open guyueh1 opened this issue 3 months ago • 2 comments

When number of nodes is >=32 and calling policy.train, on some ranks it takes a long time to perform the initial synchronization (AllReduce nccl kernel) but on some other ranks the synchronization is fast. This indicates that different ranks start jobs at different times (difference is ~800ms), and we suspect it is because of ray job submission overhead.

A potential solution is to use Ray compiled graph to reduce the job submission overhead.

guyueh1 avatar Sep 02 '25 15:09 guyueh1

@katec846 please update the latest status

euronymous-aithal avatar Oct 31 '25 07:10 euronymous-aithal

@euronymous-aithal I've implemented ray compiled graph and tested on sft algorithm. The original overhead was ~4s for 32 nodes seqlen 48k TP4 CP4 Qwen2.5-14B model. With ray compiled graph, the overhead went to <1s. However, the step time will be extreme high after 9-10 steps. it went from 19->22->49>73, but the computation time is still the same. The overheads come from the python side. Still investigating the root cause of this issue.

katec846 avatar Oct 31 '25 22:10 katec846

Renaming this issue to clarify the purpose is to minimize Ray-related overhead in GRPO including

  • returning samples to driver after generation
  • dispatching training functions
  • sending samples to train workers as arguments A similar issue but focused on SFT is tracked in subissue.

guyueh1 avatar Dec 10 '25 21:12 guyueh1