tuning_playbook
tuning_playbook copied to clipboard
Adding some caveats about hyperparameter tuning at large batch sizes
This is a great doc, thanks for putting together so much accumulated wisdom in one place!
In the section "Changing the batch size requires re-tuning most hyperparameters" I think it is worth highlighting that one of the things we found in Shallue et al. (2018) is that as the batch size grows, the range of hyperparameters that achieves good performance becomes narrower. (See, e.g., Figs. 9 & 13.) So while it is still possible to find a set of hyperparameters that performs well at nearly any batch size, it may in practice require much more tuning at very large batch sizes. Consequently a particular training run may require the same amount of computational resources at smaller batch sizes as at larger, but multiplied out over the entire set of experiments required to find the optimal set of hyperparameters, larger batch sizes will often require more compute.
This is an interesting observation. I believe from a theoretical perspective, this is expected as the batch size grows and the epoch budget is fixed. If instead we fix a step budget, I would expect the opposite trend holds---bigger batch sizes will have a larger volume of hparams that hit the target performance.
This is an important distinction because it's not like we should avoid large batch sizes because they are harder to tune, quite the opposite (assuming fixed step budget). I still view it as best practice to use large batch sizes, just ensure you train long enough. Would love to hear thoughts/comments from others on this issue though. Part of our goal here is to open discussion/debate on these topics that are rarely discussed openly.
I believe from a theoretical perspective, this is expected as the batch size grows and the epoch budget is fixed. If instead we fix a step budget, I would expect the opposite trend holds---bigger batch sizes will have a larger volume of hparams that hit the target performance.
Yes, that observation is supported by Fig. 9 in Shallue et al. (2018) which compares a fixed epoch budget vs. fixed step budget. I think it's worth noting that the approach to choosing your batch size depends on a practitioner's constraints. If compute is one's primary constraint (or, equivalently, dollars), then comparing batch sizes with the number of epochs held fixed is most relevant. In this situation, training on very large batches may not be efficient. But if, on the other hand, one has unlimited compute/dollars, one tends to instead be limited more by wall clock time, in which case comparing with a fixed step budget is most relevant. In that case it is generally better to train on very large batches.
Closing this issue since it's been idle for a while. Please chime in if you have more questions, or things to say!