Usage of gradient_accumulation_steps in GRPO

Open Fyeward opened this issue 1 month ago • 3 comments

I'm training using eight 64GB NPUs with num_generations=8. I found that when I set A: per_device_train_batch_size=8 and gradient_accumulation_steps=1, and B: per_device_train_batch_size=4 and gradient_accumulation_steps=2, there wasn't a significant difference in memory usage. However, shouldn't solution B generally use significantly less memory than solution A?

Nov 12 '25 12:11 Fyeward

For the training yes, but not for the generation. The generation is done once once the full effective batch.

Nov 14 '25 07:11 qgallouedec

For the training yes, but not for the generation. The generation is done once once the full effective batch.

Thank you. I noticed that, according to the improvement in #3283, gradient_accumulation_steps takes effect as steps_per_generations. Is this the reason why this happens during GRPO training?

Nov 14 '25 07:11 Fyeward

Without the traceback and the system info it's hard to say for sure

Nov 14 '25 15:11 qgallouedec