Usage of gradient_accumulation_steps in GRPO
I'm training using eight 64GB NPUs with num_generations=8. I found that when I set A: per_device_train_batch_size=8 and gradient_accumulation_steps=1, and B: per_device_train_batch_size=4 and gradient_accumulation_steps=2, there wasn't a significant difference in memory usage. However, shouldn't solution B generally use significantly less memory than solution A?
For the training yes, but not for the generation. The generation is done once once the full effective batch.
For the training yes, but not for the generation. The generation is done once once the full effective batch.
Thank you. I noticed that, according to the improvement in #3283, gradient_accumulation_steps takes effect as steps_per_generations. Is this the reason why this happens during GRPO training?
Without the traceback and the system info it's hard to say for sure