tuning_playbook
tuning_playbook copied to clipboard
All benefits of using a larger batch size assume the training throughput increases?
- All benefits of using a larger batch size assume the training throughput increases. If it doesn't, fix the bottleneck or use the smaller batch size.
- Gradient accumulation simulates a larger batch size than the hardware can support and therefore does not provide any throughput benefits. It should generally be avoided in applied work.
Is a more stable gradient descent guaranteed by adding batch size?
In which scenarios should the gradient accumulation method be used?