pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Enable batch size finder for distributed strategies

Open clumsy opened this issue 1 year ago • 2 comments

Description & Motivation

It's not clear why it's currently disabled here.

Pitch

There should not be a big difference in how it works vs. LR finder. E.g. all ranks try the same size under try/catch and reduce 1 or 0 based on whether it was successful. This operation is repeated with a given strategy until all ranks are successful.

Alternatives

Manual HPO

Additional context

No response

cc @borda

clumsy avatar Apr 19 '24 22:04 clumsy

Any context on this is greatly appreciated, @awaelchli @carmocca. Thanks!

clumsy avatar Apr 19 '24 22:04 clumsy

The batch size finder requires to handle exceptions (OOM) to determine whether a batch fits or not. This requires a synchronization among ranks such that the decision is consistent. This logic is not implemented, which is why it is not supported.

awaelchli avatar Jun 22 '24 22:06 awaelchli