returnn
returnn copied to clipboard
Hang in training (often with multi GPU training)
Distributed training, single node, 4 GPUs.
...
ep 33 train, step 10, ctc_4 2.259, ctc_8 1.880, ctc 1.840, num_seqs 10, max_size:time 237360, max_size:out-spatial 52, mem_usage:cuda:1 6.6GB, 0.600 sec/step
ep 33 train, step 11, ctc_4 2.163, ctc_8 1.883, ctc 1.840, num_seqs 9, max_size:time 263648, max_size:out-spatial 57, mem_usage:cuda:3 6.6GB, 0.604 sec/step
ep 33 train, step 11, ctc_4 2.344, ctc_8 1.988, ctc 1.912, num_seqs 9, max_size:time 261801, max_size:out-spatial 55, mem_usage:cuda:2 6.5GB, 0.633 sec/step
ep 33 train, step 11, ctc_4 2.879, ctc_8 2.534, ctc 2.524, num_seqs 9, max_size:time 263648, max_size:out-spatial 54, mem_usage:cuda:0 6.7GB, 0.690 sec/step
ep 33 train, step 11, ctc_4 2.072, ctc_8 1.608, ctc 1.537, num_seqs 9, max_size:time 262065, max_size:out-spatial 55, mem_usage:cuda:1 6.6GB, 0.667 sec/step
ep 33 train, step 12, ctc_4 2.517, ctc_8 2.150, ctc 2.065, num_seqs 10, max_size:time 238480, max_size:out-spatial 55, mem_usage:cuda:1 6.6GB, 0.601 sec/step
ep 33 train, step 12, ctc_4 2.652, ctc_8 2.407, ctc 2.361, num_seqs 10, max_size:time 240000, max_size:out-spatial 59, mem_usage:cuda:0 6.7GB, 0.679 sec/step
ep 33 train, step 12, ctc_4 2.592, ctc_8 2.339, ctc 2.295, num_seqs 11, max_size:time 214632, max_size:out-spatial 59, mem_usage:cuda:2 6.5GB, 0.711 sec/step
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 132166299779072)>, proc 11842.
Thread <Thread(QueueFeederThread, started daemon 132146790385216)>:
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 128575112818688)>, proc 11843.
Thread <Thread(QueueFeederThread, started daemon 128555719587392)>:
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 134816991117312)>, proc 11844.
...
Send signal SIGINT to pid 13385/'devtrain worker proc 2/4'
Send signal SIGINT to pid 15767/'NonDaemonicSpawnProcess-17'
Send signal SIGINT to pid 12108/'train worker proc 2/4'
Send signal SIGINT to pid 12105/'train worker proc 3/4'
Send signal SIGINT to pid 13012/'devtrain worker proc 2/4'
Send signal SIGINT to pid 13013/'devtrain worker proc 3/4'
Send signal SIGINT to pid 12899/'dev worker proc 2/4'
Send signal SIGINT to pid 12900/'dev worker proc 3/4'
Send signal SIGINT to pid 13014/'devtrain worker proc 4/4'
Send signal SIGINT to pid 12100/'train worker proc 1/4'
Send signal SIGINT to pid 12101/'train worker proc 2/4'
Send signal SIGINT to pid 12109/'train worker proc 4/4'
Send signal SIGINT to pid 13011/'devtrain worker proc 1/4'
Send signal SIGINT to pid 12898/'dev worker proc 1/4'
Send signal SIGINT to pid 14968/'NonDaemonicSpawnProcess-16'
Send signal SIGINT to pid 12901/'dev worker proc 4/4'
Send signal SIGINT to pid 15768/'NonDaemonicSpawnProcess-17'
[2024-06-28 00:32:35,464] INFO: Run time: 7:10:22 CPU: 0.60% RSS: 13.41GB VMS: 343.25GB
...
And then it hangs.
% ps a --forest -u $(whoami) -o pid,comm
PID COMMAND
35776 sshd
35777 \_ zsh
35819 \_ ps
11806 slurm_script
11811 \_ python3.11
11824 \_ python3.11
11842 \_ python3.11 <defunct>
11843 \_ python3.11 <defunct>
11986 | \_ python3.11 <defunct>
11844 \_ python3.11
11983 | \_ python3.11 <defunct>
11845 \_ python3.11
11982 \_ python3.11
11984 \_ watch memory
12104 \_ MPD worker 0
12107 \_ MPD worker 1
12111 \_ MPD worker 2
12114 \_ MPD worker 3
13134 \_ python3.11
13146 \_ MPD worker 0
13147 \_ MPD worker 1
13148 \_ MPD worker 2
13149 \_ MPD worker 3
13388 \_ MPD worker 0
13389 \_ MPD worker 1
13390 \_ MPD worker 2
13391 \_ MPD worker 3
13679 \_ TDL worker 0
13920 | \_ MPD worker 0
14022 | \_ MPD worker 1
14131 | \_ MPD worker 2
14241 | \_ MPD worker 3
14967 \_ TDL worker 0
15213 | \_ MPD worker 0
15317 | \_ MPD worker 1
15462 | \_ MPD worker 2
15555 | \_ MPD worker 3
15766 \_ TDL worker 0
16005 \_ MPD worker 0
16110 \_ MPD worker 1
16210 \_ MPD worker 2
16309 \_ MPD worker 3
And stack trace:
% py-spy dump -p 11845
Process 11845: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lzGIgU9Oy8je/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)
Thread 11845 (idle): "MainThread"
backward (torch/autograd/__init__.py:251)
backward (torch/_tensor.py:492)
train_epoch (returnn/returnn/torch/engine.py:421)
train (returnn/returnn/torch/engine.py:236)
execute_main_task (returnn/returnn/__main__.py:542)
main (returnn/returnn/__main__.py:736)
<module> (returnn/rnn.py:11)
Thread 13782 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:975)
_bootstrap_inner (threading.py:1038)
_bootstrap (threading.py:995)
Thread 14735 (idle)
Thread 14734 (idle)
Thread 14737 (active)
Thread 14736 (idle)
Thread 15070 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:975)
_bootstrap_inner (threading.py:1038)
_bootstrap (threading.py:995)
Thread 15868 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:975)
_bootstrap_inner (threading.py:1038)
_bootstrap (threading.py:995)
dmesg:
...
[ +0.000127] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[ +0.000002] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[ +0.000016] pcieport 0000:80:03.0: AER: device recovery failed
[ +0.000002] pcieport 0000:80:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:80:03.0
[ +0.000020] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ +0.000184] pcieport 0000:80:03.0: device [8086:6f08] error status/mask=00004020/00000000
[ +0.000156] pcieport 0000:80:03.0: [ 5] SDES
[ +0.000148] pcieport 0000:80:03.0: [14] CmpltTO (First)
[ +0.000153] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[ +0.000001] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[ +0.130524] NVRM: GPU at PCI:0000:82:00: GPU-0c6d20ff-f6a7-957f-af96-ffdf1b880057
[ +0.000012] NVRM: GPU Board Serial Number: 0321217096330
[ +0.000003] NVRM: Xid (PCI:0000:82:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ +0.000006] NVRM: GPU 0000:82:00.0: GPU has fallen off the bus.
[ +0.000003] NVRM: GPU 0000:82:00.0: GPU serial number is 0321217096330.
[ +0.000013] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ +0.031970] pcieport 0000:80:03.0: AER: Root Port link has been reset (0)
[ +0.000032] pcieport 0000:80:03.0: AER: device recovery failed
[Jun28 00:03] pcieport 0000:80:03.0: AER: Uncorrected (Fatal) error received: 0000:80:03.0
[ +0.000013] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ +0.000292] pcieport 0000:80:03.0: device [8086:6f08] error status/mask=00004020/00000000
[ +0.000237] pcieport 0000:80:03.0: [ 5] SDES
[ +0.000240] pcieport 0000:80:03.0: [14] CmpltTO (First)
[ +0.000245] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[ +0.000002] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[ +0.149679] pcieport 0000:80:03.0: AER: Root Port link has been reset (0)
[ +0.000031] pcieport 0000:80:03.0: AER: device recovery failed
It might be just a GPU issue.
Maybe related:
- #314, although that was Horovod.
- #1520, some error (not timeout as here), but then also hang.
- #1496, some error, but then also hangs. Although the hang was then attributed to another problem (https://github.com/rwth-i6/returnn/issues/1497).
- https://github.com/rwth-i6/sisyphus/issues/176 and https://github.com/rwth-i6/sisyphus/pull/177
Sometimes also like this:
...
ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step
ep 28 train, step 56, ctc_4 2.049, ctc_8 1.746, ctc 1.688, num_seqs 8, max_size:time 276496, max_size:out-spatial 62, mem_usage:cuda:2 6.3GB, 0.678 sec/step
ep 28 train, step 57, ctc_4 2.239, ctc_8 1.990, ctc 1.961, num_seqs 8, max_size:time 278959, max_size:out-spatial 61, mem_usage:cuda:0 6.3GB, 0.653 sec/step
ep 28 train, step 57, ctc_4 2.137, ctc_8 1.780, ctc 1.708, num_seqs 8, max_size:time 280104, max_size:out-spatial 60, mem_usage:cuda:3 6.3GB, 0.674 sec/step
ep 28 train, step 57, ctc_4 2.338, ctc_8 1.937, ctc 1.926, num_seqs 9, max_size:time 252480, max_size:out-spatial 55, mem_usage:cuda:1 6.3GB, 0.693 sec/step
ep 28 train, step 57, ctc_4 3.121, ctc_8 2.822, ctc 2.807, num_seqs 8, max_size:time 276760, max_size:out-spatial 64, mem_usage:cuda:2 6.3GB, 0.675 sec/step
ep 28 train, step 58, ctc_4 2.397, ctc_8 2.037, ctc 1.967, num_seqs 9, max_size:time 255120, max_size:out-spatial 65, mem_usage:cuda:3 6.3GB, 0.631 sec/step
ep 28 train, step 58, ctc_4 2.598, ctc_8 2.242, ctc 2.165, num_seqs 8, max_size:time 279224, max_size:out-spatial 56, mem_usage:cuda:0 6.3GB, 0.657 sec/step
ep 28 train, step 58, ctc_4 2.433, ctc_8 2.155, ctc 2.129, num_seqs 10, max_size:time 228024, max_size:out-spatial 63, mem_usage:cuda:1 6.3GB, 0.628 sec/step
MEMORY: sub proc TDL worker 0(5599) increased RSS: rss=524.3MB pss=372.6MB uss=356.5MB shared=167.8MB
MEMORY: sub proc TDL worker 0(5603) increased RSS: rss=454.3MB pss=302.6MB uss=286.5MB shared=167.7MB
MEMORY: sub proc TDL worker 0(5600) increased RSS: rss=523.1MB pss=371.6MB uss=355.5MB shared=167.6MB
MEMORY: total (main 3853, 2024-06-28, 17:46:24, 21 procs): pss=6.3GB uss=6.0GB
MEMORY: total (main 3850, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.4GB
MEMORY: total (main 3851, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.3GB
MEMORY: sub proc TDL worker 0(5602) increased RSS: rss=542.4MB pss=390.7MB uss=374.6MB shared=167.7MB
MEMORY: total (main 3852, 2024-06-28, 17:46:24, 21 procs): pss=6.4GB uss=6.1GB
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 130292506959872)>, proc 3852.
...
Send signal SIGINT to pid 4123/'train worker proc 4/4'
Send signal SIGINT to pid 4119/'train worker proc 3/4'
Send signal SIGINT to pid 5063/'devtrain worker proc 1/4'
Send signal SIGINT to pid 5064/'devtrain worker proc 2/4'
Send signal SIGINT to pid 5065/'devtrain worker proc 3/4'
Send signal SIGINT to pid 5066/'devtrain worker proc 4/4'
Send signal SIGINT to pid 5602/'NonDaemonicSpawnProcess-15'
Send signal SIGINT to pid 4604/'dev worker proc 2/4'
Send signal SIGINT to pid 4611/'dev worker proc 4/4'
Send signal SIGINT to pid 4607/'dev worker proc 3/4'
Send signal SIGINT to pid 4601/'dev worker proc 1/4'
Send signal SIGINT to pid 4114/'train worker proc 1/4'
[2024-06-28 17:46:56,408] INFO: Run time: 0:03:16 CPU: 1.00% RSS: 21.22GB VMS: 733.07GB
And then hanging.
Procs:
zeyer@cn-252 ~ % ps a --forest -u $(whoami) -o pid,comm
PID COMMAND
6791 sshd
6792 \_ zsh
6810 \_ ps
3790 slurm_script
3804 \_ python3.11
3832 \_ python3.11
3850 \_ python3.11
3989 | \_ python3.11
3995 | \_ watch memory
4110 | \_ MPD worker 0
4111 | \_ MPD worker 1
4115 | \_ MPD worker 2
4121 | \_ MPD worker 3
4589 | \_ python3.11
4600 | \_ MPD worker 0
4603 | \_ MPD worker 1
4608 | \_ MPD worker 2
4612 | \_ MPD worker 3
5057 | \_ MPD worker 0
5059 | \_ MPD worker 1
5061 | \_ MPD worker 2
5062 | \_ MPD worker 3
5603 | \_ TDL worker 0
5841 | \_ MPD worker 0
5944 | \_ MPD worker 1
6053 | \_ MPD worker 2
6159 | \_ MPD worker 3
3851 \_ python3.11
3991 | \_ python3.11
3993 | \_ watch memory
4112 | \_ MPD worker 0
4116 | \_ MPD worker 1
4120 | \_ MPD worker 2
4124 | \_ MPD worker 3
4577 | \_ python3.11
4602 | \_ MPD worker 0
4606 | \_ MPD worker 1
4609 | \_ MPD worker 2
4614 | \_ MPD worker 3
5051 | \_ MPD worker 0
5053 | \_ MPD worker 1
5055 | \_ MPD worker 2
5056 | \_ MPD worker 3
5600 | \_ TDL worker 0
5842 | \_ MPD worker 0
5947 | \_ MPD worker 1
6055 | \_ MPD worker 2
6163 | \_ MPD worker 3
3852 \_ python3.11 <defunct>
3853 \_ python3.11
3988 \_ python3.11
3992 \_ watch memory
4113 \_ MPD worker 0
4118 \_ MPD worker 1
4122 \_ MPD worker 2
4125 \_ MPD worker 3
4583 \_ python3.11
4599 \_ MPD worker 0
4605 \_ MPD worker 1
4610 \_ MPD worker 2
4613 \_ MPD worker 3
5052 \_ MPD worker 0
5054 \_ MPD worker 1
5058 \_ MPD worker 2
5060 \_ MPD worker 3
5599 \_ TDL worker 0
5840 \_ MPD worker 0
5945 \_ MPD worker 1
6049 \_ MPD worker 2
6157 \_ MPD worker 3
Those procs just hang. E.g. py-spy:
% py-spy dump -p 3850
Process 3850: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.ns6wGzNHZ8zI/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)
^C