Albert Zeyer
Albert Zeyer
Actually let's keep this open until we got some response, and then wait until we can update `_can_exit_saved_tensors_hooks_inside_hooks`.
Also note, the current solution is maybe not so optimal. The current potential ways that we would exit the `torch.autograd.graph.saved_tensors_hooks`: - `gradient_checkpoint_scope.__exit__`. But likely not, as there are likely refs...
Ah, it's a [heisenbug](https://en.wikipedia.org/wiki/Heisenbug). With `CUDA_LAUNCH_BLOCKING=1`, the bug does not appear anymore. (Or maybe different hardware? Now running on `cn-238`, but it's also 4x1080, just as before.)
Do you know about `use_train_proc_manager`? Do you have that enabled? It currently only works for single GPU training. But this was extremely helpful. It should catch just any case. Whenever...
> Can we retry the train step instead I don't think there is any safe way to recover from this within the running proc. You also cannot easily know what...
Btw, I'm pretty sure this is a common problem for all the large model training. I have read often about it. And I also read that they all developed some...
Can you show the content list of the file? I specifically wonder what `__returnn_config__._weight_decay_blacklist` is exactly. What is `__returnn_config__`? Why is it in there? What type is it? What does...
Why does the returnn config become part of the optimizer state?