returnn
returnn copied to clipboard
Torch gradient_checkpoint_scope potential memory leak
Training was running fine for 29 subeochs but then crashed with CPU OOM.
While I sometimes see CPU OOMs in my setup, this is usually after longer trainings. So it was a bit unexpected to me to get the CPU OOM so early. I'm not really sure whether this is due to the gradient_checkpoint_scope
or sth else, but the usage of gradient_checkpoint_scope
is the sole and main difference to my earlier setups. But still it could be some random hiccup. So let's see if I get this more frequently now.