Albert Zeyer comments

Results 938 comments of


                                            Albert Zeyer

Torch gradient_checkpoint_scope could trigger segmentation fault?

Actually let's keep this open until we got some response, and then wait until we can update `_can_exit_saved_tensors_hooks_inside_hooks`.

Torch gradient_checkpoint_scope could trigger segmentation fault?

Also note, the current solution is maybe not so optimal. The current potential ways that we would exit the `torch.autograd.graph.saved_tensors_hooks`: - `gradient_checkpoint_scope.__exit__`. But likely not, as there are likely refs...

RuntimeError: CUDA error: an illegal memory access was encountered

Ah, it's a [heisenbug](https://en.wikipedia.org/wiki/Heisenbug). With `CUDA_LAUNCH_BLOCKING=1`, the bug does not appear anymore. (Or maybe different hardware? Now running on `cn-238`, but it's also 4x1080, just as before.)

RuntimeError: CUDA error: an illegal memory access was encountered

Do you know about `use_train_proc_manager`? Do you have that enabled? It currently only works for single GPU training. But this was extremely helpful. It should catch just any case. Whenever...

RuntimeError: CUDA error: an illegal memory access was encountered

> Can we retry the train step instead I don't think there is any safe way to recover from this within the running proc. You also cannot easily know what...

RuntimeError: CUDA error: an illegal memory access was encountered

Btw, I'm pretty sure this is a common problem for all the large model training. I have read often about it. And I also read that they all developed some...

`torch.load` crash, due to changed defaults in torch >= 2.6

Can you show the content list of the file? I specifically wonder what `__returnn_config__._weight_decay_blacklist` is exactly. What is `__returnn_config__`? Why is it in there? What type is it? What does...

`torch.load` crash, due to changed defaults in torch >= 2.6

Why does the returnn config become part of the optimizer state?