All benefits of using a larger batch size assume the training throughput increases?

Open SimLif opened this issue 2 years ago • 0 comments

All benefits of using a larger batch size assume the training throughput increases. If it doesn't, fix the bottleneck or use the smaller batch size.

Gradient accumulation simulates a larger batch size than the hardware can support and therefore does not provide any throughput benefits. It should generally be avoided in applied work.

Is a more stable gradient descent guaranteed by adding batch size?
In which scenarios should the gradient accumulation method be used?

Jan 28 '23 12:01 SimLif