returnn RuntimeError: CUDA error: an illegal memory access was encountered

...
ep 1 train, step 294, ctc_4 4.553, ctc_8 4.531, ctc 4.510, num_seqs 11, max_size:time 201384, max_size:out-spatial 149, mem_usage:cuda:0 5.9GB, 0.411 sec/step
ep 1 train, step 294, ctc_4 4.516, ctc_8 4.528, ctc 4.528, num_seqs 10, max_size:time 239009, max_size:out-spatial 133, mem_usage:cuda:2 5.9GB, 0.455 sec/step
ep 1 train, step 295, ctc_4 4.569, ctc_8 4.623, ctc 4.650, num_seqs 9, max_size:time 245433, max_size:out-spatial 136, mem_usage:cuda:1 5.9GB, 0.404 sec/step
ep 1 train, step 295, ctc_4 4.467, ctc_8 4.479, ctc 4.519, num_seqs 9, max_size:time 247369, max_size:out-spatial 135, mem_usage:cuda:3 5.9GB, 0.428 sec/step
ep 1 train, step 295, ctc_4 4.500, ctc_8 4.590, ctc 4.528, num_seqs 9, max_size:time 245081, max_size:out-spatial 131, mem_usage:cuda:0 5.9GB, 0.405 sec/step
ep 1 train, step 295, ctc_4 4.620, ctc_8 4.670, ctc 4.536, num_seqs 10, max_size:time 236369, max_size:out-spatial 135, mem_usage:cuda:2 5.9GB, 0.476 sec/step
ep 1 train, step 296, ctc_4 4.598, ctc_8 4.540, ctc 4.563, num_seqs 9, max_size:time 248953, max_size:out-spatial 156, mem_usage:cuda:3 5.9GB, 0.400 sec/step
ep 1 train, step 296, ctc_4 4.707, ctc_8 4.549, ctc 4.544, num_seqs 12, max_size:time 199296, max_size:out-spatial 131, mem_usage:cuda:0 5.9GB, 0.408 sec/step
ep 1 train, step 296, ctc_4 4.515, ctc_8 4.595, ctc 4.611, num_seqs 10, max_size:time 223920, max_size:out-spatial 121, mem_usage:cuda:1 5.9GB, 0.484 sec/step
ep 1 train, step 296, ctc_4 4.560, ctc_8 4.889, ctc 4.619, num_seqs 10, max_size:time 236457, max_size:out-spatial 144, mem_usage:cuda:2 5.9GB, 0.405 sec/step
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/extern_data.py", line 55, in raw_dict_to_extern_data
    line: data.raw_tensor = raw_tensor.to(device)
    locals:
      data = <local> Tensor{'data', [B?,T|'time'[B?],F|F'audio'(1)]}
      data.raw_tensor = <local> None
      raw_tensor = <local> tensor[9, 242353, 1] n=2181177 (8.3Mb) x∈[-1.033, 1.001] μ=0.000 σ=0.087
      raw_tensor.to = <local> <built-in method to of Tensor object at 0x7ca9d419d6d0>
      device = <local> 'cuda:0', len = 6
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

...
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7cab17f92617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7cab17f4d98d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7cab182cd9f8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x169b6 (0x7cab182969b6 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1947d (0x7cab1829947d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1989d (0x7cab1829989d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x513c46 (0x7caad8d30c46 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x55ca7 (0x7cab17f77ca7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7cab17f6fcb3 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7cab17f6fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x4bd16c7 (0x7caac64da6c7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::deleteNode(torch::autograd::Node*) + 0xa9 (0x7caac64d2b59 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: std::_Sp_counted_deleter<torch::autograd::generated::SumBackward0*, void (*)(torch::autograd::Node*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xe (0x7caac5baf1ee in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x4ba8990 (0x7caac64b1990 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: c10::TensorImpl::~TensorImpl() + 0x1da (0x7cab17f6fcaa in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #15: c10::TensorImpl::~TensorImpl() + 0x9 (0x7cab17f6fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #16: <unknown function> + 0x7c84d8 (0x7caad8fe54d8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #17: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7caad8fe5865 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #33: <unknown function> + 0x291b7 (0x7cab445ab1b7 in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #34: __libc_start_main + 0x7c (0x7cab445ab26c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #35: _start + 0x21 (0x401071 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11)

Fatal Python error: Aborted

Current thread 0x00007cab44581000 (most recent call first):
  Garbage-collecting
  <no Python frame>
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/c14b833885/native_signal_handler.so(signal_handler+0x4b)[0x7cab18e3b20b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7cab445bef40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7cab44608e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7cab445beea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7cab445bef40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7cab44608e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7cab445beea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7cab445aa45c]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xa58d9)[0x7cab1992b8d9]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xb0f0a)[0x7cab19936f0a]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xaff79)[0x7cab19935f79]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(__gxx_personality_v0+0x86)[0x7cab19936696]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libgcc_s.so.1(+0x17934)[0x7cab43ce2934]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7cab43ce338d]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x117f7)[0x7cab182917f7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x1989d)[0x7cab1829989d]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x513c46)[0x7caad8d30c46]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(+0x55ca7)[0x7cab17f77ca7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1e3)[0x7cab17f6fcb3]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7cab17f6fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4bd16c7)[0x7caac64da6c7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN5torch8autograd10deleteNodeEPNS0_4NodeE+0xa9)[0x7caac64d2b59]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZNSt19_Sp_counted_deleterIPN5torch8autograd9generated12SumBackward0EPFvPNS1_4NodeEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0xe)[0x7caac5baf1ee]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4ba8990)[0x7caac64b1990]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1da)[0x7cab17f6fcaa]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7cab17f6fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7c84d8)[0x7caad8fe54d8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(_Z28THPVariable_subclass_deallocP7_object+0x305)[0x7caad8fe5865]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1edb1d)[0x7cab44a63b1d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1ec6e3)[0x7cab44a626e3]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e1a5d)[0x7cab44a57a5d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1ec564)[0x7cab44a62564]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e003d)[0x7cab44a5603d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1dfe7d)[0x7cab44a55e7d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e1a56)[0x7cab44a57a56]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1ec564)[0x7cab44a62564]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e003d)[0x7cab44a5603d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x258a56)[0x7cab44acea56]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f85b)[0x7cab44b1585b]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29ff60)[0x7cab44b15f60]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_FinalizeEx+0x7b)[0x7cab44b0a92b]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_RunMain+0x180)[0x7cab44b14d40]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_BytesMain+0x29)[0x7cab44b14ab9]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x291b7)[0x7cab445ab1b7]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(__libc_start_main+0x7c)[0x7cab445ab26c]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11(_start+0x21)[0x401071]
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.6.9.55]:26753
...

I have seen this error before a couple of times, but usually a restart of the job "fixed" it, and it was rare. So I attributed it to some hardware hiccup (we have many similar issues with our 1080s... e.g. #1520, #1558, #1496, ...).

However, here I have a case which doesn't seem to go away after restarts, and also occurs always exactly a the same step.

It also happens for many other similar setups where the vocab dimension is low, so this is probably the key factor, as other setups with higher vocab dim work just fine. But all of the setups with SPM 1k, 512, 128 have crashed now with this error. The step where they crashed was slightly different though, depending on the vocab. Maybe it's some long sequence which triggers this in the CTC calculation.

This is also multi GPU training but I'm not sure this is relevant.

Some more log (stripped down):

RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660812, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660809, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660811, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
Hostname: cn-255
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
Hostname: cn-255
Hostname: cn-255 
RETURNN starting up, version 1.20240708.175624+git.853bb23d, date/time 2024-07-09-07-42-52 (UTC+0000), pid 660810, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.5AWTwj5VHV2P/output/returnn.config']
Hostname: cn-255
Installed native_signal_handler.so.
Installed native_signal_handler.so.
Installed native_signal_handler.so.
Installed native_signal_handler.so.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
Torch: Hostname cn-255, pid 660811, using GPU 2.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
Torch: Hostname cn-255, pid 660812, using GPU 3.
Torch: Hostname cn-255, pid 660810, using GPU 1.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
Torch: Hostname cn-255, pid 660809, using GPU 0.
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
Available CUDA devices:
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
Available CUDA devices:
Available CUDA devices:
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
Available CUDA devices:
  1/4: cuda:0
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 0
  2/4: cuda:1
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 1
...
ep 1 train, step 97, ctc_4 4.669, ctc_8 4.739, ctc 4.654, num_seqs 11, max_size:time 209881, max_size:out-spatial 129, mem_usage:cuda:2 5.6GB, 0.441 sec/step
ep 1 train, step 97, ctc_4 4.700, ctc_8 4.769, ctc 5.140, num_seqs 17, max_size:time 137448, max_size:out-spatial 104, mem_usage:cuda:3 5.6GB, 0.424 sec/step
ep 1 train, step 98, ctc_4 4.736, ctc_8 4.656, ctc 4.864, num_seqs 13, max_size:time 184185, max_size:out-spatial 105, mem_usage:cuda:1 5.6GB, 0.418 sec/step
ep 1 train, step 98, ctc_4 4.655, ctc_8 4.644, ctc 4.731, num_seqs 15, max_size:time 157360, max_size:out-spatial 99, mem_usage:cuda:3 5.6GB, 0.396 sec/step
ep 1 train, step 98, ctc_4 4.710, ctc_8 4.653, ctc 4.759, num_seqs 13, max_size:time 172080, max_size:out-spatial 109, mem_usage:cuda:0 5.6GB, 0.448 sec/step
ep 1 train, step 98, ctc_4 4.644, ctc_8 5.009, ctc 4.551, num_seqs 11, max_size:time 212609, max_size:out-spatial 115, mem_usage:cuda:2 5.6GB, 0.458 sec/step
cn-255:660809:660809 [0] NCCL INFO Bootstrap : Using enp5s0:10.6.9.55<0>
cn-255:660809:660809 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
cn-255:660809:660809 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
cn-255:660809:660809 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.1+cuda12.1
MEMORY: main proc python3.11(660810) increased RSS: rss=1.6GB pss=1.2GB uss=1.0GB shared=549.1MB
MEMORY: main proc python3.11(660811) increased RSS: rss=1.6GB pss=1.2GB uss=1.0GB shared=551.5MB
MEMORY: total (main 660810, 2024-07-09, 07:44:51, 21 procs): pss=6.2GB uss=5.9GB
MEMORY: main proc python3.11(660812) increased RSS: rss=1.5GB pss=1.2GB uss=1.0GB shared=549.4MB
cn-255:660809:663643 [0] NCCL INFO NET/IB : No device found.
cn-255:660809:663643 [0] NCCL INFO NET/Socket : Using [0]enp5s0:10.6.9.55<0>
cn-255:660809:663643 [0] NCCL INFO Using network Socket
cn-255:660809:663643 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
cn-255:660809:663643 [0] NCCL INFO NVLS multicast support is not available on dev 0
cn-255:660809:663643 [0] NCCL INFO Channel 00/02 :    0   1   2   3
cn-255:660809:663643 [0] NCCL INFO Channel 01/02 :    0   1   2   3
cn-255:660809:663643 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cn-255:660809:663643 [0] NCCL INFO P2P Chunksize set to 131072
cn-255:660809:663643 [0] NCCL INFO Channel 00 : 0[2000] -> 1[3000] via SHM/direct/direct
cn-255:660809:663643 [0] NCCL INFO Channel 01 : 0[2000] -> 1[3000] via SHM/direct/direct
cn-255:660809:663643 [0] NCCL INFO Connected all rings
cn-255:660809:663643 [0] NCCL INFO Connected all trees
cn-255:660809:663643 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cn-255:660809:663643 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MEMORY: total (main 660811, 2024-07-09, 07:44:51, 21 procs): pss=6.3GB uss=6.0GB
cn-255:660809:663643 [0] NCCL INFO comm 0x1214fd90 rank 0 nranks 4 cudaDev 0 busId 2000 commId 0x86dd52ba5e99125f - Init COMPLETE
ep 1 train, step 99, ctc_4 4.801, ctc_8 4.946, ctc 4.701, num_seqs 11, max_size:time 215425, max_size:out-spatial 116, mem_usage:cuda:2 5.6GB, 2.135 sec/step
ep 1 train, step 99, ctc_4 4.787, ctc_8 4.728, ctc 4.716, num_seqs 13, max_size:time 180401, max_size:out-spatial 106, mem_usage:cuda:3 5.6GB, 2.225 sec/step
ep 1 train, step 99, ctc_4 4.687, ctc_8 4.699, ctc 4.861, num_seqs 12, max_size:time 187969, max_size:out-spatial 110, mem_usage:cuda:1 5.6GB, 2.252 sec/step
ep 1 train, step 99, ctc_4 4.756, ctc_8 4.632, ctc 4.686, num_seqs 12, max_size:time 193520, max_size:out-spatial 114, mem_usage:cuda:0 5.6GB, 2.207 sec/step
MEMORY: total (main 660812, 2024-07-09, 07:44:52, 21 procs): pss=6.2GB uss=5.9GB
ep 1 train, step 100, ctc_4 4.880, ctc_8 4.915, ctc 4.879, num_seqs 15, max_size:time 154224, max_size:out-spatial 109, mem_usage:cuda:3 5.6GB, 0.468 sec/step
...

Log-file at i6: /u/zeyer/setups/combined/2021-05-31/alias/ctc/v6-relPosAttDef-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-maxSeqLenAudio19_5-wd1e_2-lrlin1e_5_295k-featBN-speedpertV2-spm128/train/engine/i6_core.returnn.training.ReturnnTrainingJob.5AWTwj5VHV2P.run.7998996.1

Jul 09 '24 09:07 albertz

Ah, it's a heisenbug. With CUDA_LAUNCH_BLOCKING=1, the bug does not appear anymore. (Or maybe different hardware? Now running on cn-238, but it's also 4x1080, just as before.)

Jul 09 '24 11:07 albertz

I am encountering these issues alot, and currently deal with them by automatically restarting the training job (via sisyphus, by deleting the error condition). This is suboptimal because you lose the job allocation (and your Q slot) every time this error occurs. Can we retry the train step instead, or catch this from the train proc manager and restart the trainer that way instead?

Apr 24 '25 09:04 NeoLegends

Do you know about use_train_proc_manager? Do you have that enabled? It currently only works for single GPU training. But this was extremely helpful. It should catch just any case. Whenever it sees that there was some progress in training (at least one epoch trained), and it crashed, it will restart.

On what hardware do you observe this? On the H100, I think I never saw this (or maybe I never noticed because of use_train_proc_manager, not sure).

Apr 24 '25 10:04 albertz

I do use use_train_proc_manager, but when it only works for single-GPU training that's probably why it doesn't have an effect for me.

I am observing this mainly on A5000 and A6000 series GPUs.

Apr 25 '25 08:04 NeoLegends

Can we retry the train step instead

I don't think there is any safe way to recover from this within the running proc. You also cannot easily know what parts of the current train step were already executed. E.g. maybe it has updated already some of the params but not all? Maybe some of the memory is corrupted now?

I think the only safe/sane way is like what we do with use_train_proc_manager.

So, maybe we should think about how we can make that work for distributed training. The question is, at what level do we want to do the restart? Do we want to restart the torchrun/mpirun? Does that work well also with multi-node training? Or maybe we only restart the crashing proc, but then we somehow need to tell the other procs that they should reset to the prev stored epoch. Will that work that we can restart one proc?

I think that's not so simple. Also, we should maybe write some test case where we can test this easily, which runs some dummy torchrun or so, emulating distributed training, otherwise I don't see how you can even develop/test this.

Apr 25 '25 08:04 albertz

Btw, I'm pretty sure this is a common problem for all the large model training. I have read often about it. And I also read that they all developed some solutions to this which would automatically handle this somehow, i.e. in distributed training when some nodes fail from time to time that this does not crash the whole training but that it automatically recovers, maybe even has some fallback nodes, etc. So it might make sense to take some look how other people have solved this.

Apr 25 '25 08:04 albertz

returnn returnn copied to clipboard

RuntimeError: CUDA error: an illegal memory access was encountered

returnn
returnn copied to clipboard