pytorch-lightning
pytorch-lightning copied to clipboard
Enable batch size finder for distributed strategies
Description & Motivation
It's not clear why it's currently disabled here.
Pitch
There should not be a big difference in how it works vs. LR finder. E.g. all ranks try the same size under try/catch and reduce 1 or 0 based on whether it was successful. This operation is repeated with a given strategy until all ranks are successful.
Alternatives
Manual HPO
Additional context
No response
cc @borda
Any context on this is greatly appreciated, @awaelchli @carmocca. Thanks!
The batch size finder requires to handle exceptions (OOM) to determine whether a batch fits or not. This requires a synchronization among ranks such that the decision is consistent. This logic is not implemented, which is why it is not supported.