returnn Torch gradient_checkpoint_scope potential memory leak

Torch gradient_checkpoint_scope potential memory leak

Open albertz opened this issue 7 months ago • 0 comments

Training was running fine for 29 subeochs but then crashed with CPU OOM.

While I sometimes see CPU OOMs in my setup, this is usually after longer trainings. So it was a bit unexpected to me to get the CPU OOM so early. I'm not really sure whether this is due to the gradient_checkpoint_scope or sth else, but the usage of gradient_checkpoint_scope is the sole and main difference to my earlier setups. But still it could be some random hiccup. So let's see if I get this more frequently now.

Jul 12 '24 22:07 albertz

returnn returnn copied to clipboard

Torch gradient_checkpoint_scope potential memory leak

returnn
returnn copied to clipboard