Hang in training (often with multi GPU training)

Open albertz opened this issue 1 year ago • 1 comments

Distributed training, single node, 4 GPUs.

...
ep 33 train, step 10, ctc_4 2.259, ctc_8 1.880, ctc 1.840, num_seqs 10, max_size:time 237360, max_size:out-spatial 52, mem_usage:cuda:1 6.6GB, 0.600 sec/step
ep 33 train, step 11, ctc_4 2.163, ctc_8 1.883, ctc 1.840, num_seqs 9, max_size:time 263648, max_size:out-spatial 57, mem_usage:cuda:3 6.6GB, 0.604 sec/step
ep 33 train, step 11, ctc_4 2.344, ctc_8 1.988, ctc 1.912, num_seqs 9, max_size:time 261801, max_size:out-spatial 55, mem_usage:cuda:2 6.5GB, 0.633 sec/step
ep 33 train, step 11, ctc_4 2.879, ctc_8 2.534, ctc 2.524, num_seqs 9, max_size:time 263648, max_size:out-spatial 54, mem_usage:cuda:0 6.7GB, 0.690 sec/step
ep 33 train, step 11, ctc_4 2.072, ctc_8 1.608, ctc 1.537, num_seqs 9, max_size:time 262065, max_size:out-spatial 55, mem_usage:cuda:1 6.6GB, 0.667 sec/step
ep 33 train, step 12, ctc_4 2.517, ctc_8 2.150, ctc 2.065, num_seqs 10, max_size:time 238480, max_size:out-spatial 55, mem_usage:cuda:1 6.6GB, 0.601 sec/step
ep 33 train, step 12, ctc_4 2.652, ctc_8 2.407, ctc 2.361, num_seqs 10, max_size:time 240000, max_size:out-spatial 59, mem_usage:cuda:0 6.7GB, 0.679 sec/step
ep 33 train, step 12, ctc_4 2.592, ctc_8 2.339, ctc 2.295, num_seqs 11, max_size:time 214632, max_size:out-spatial 59, mem_usage:cuda:2 6.5GB, 0.711 sec/step
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 132166299779072)>, proc 11842.

Thread <Thread(QueueFeederThread, started daemon 132146790385216)>:
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 128575112818688)>, proc 11843.

Thread <Thread(QueueFeederThread, started daemon 128555719587392)>:
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 134816991117312)>, proc 11844.


...


Send signal SIGINT to pid 13385/'devtrain worker proc 2/4'
Send signal SIGINT to pid 15767/'NonDaemonicSpawnProcess-17'
Send signal SIGINT to pid 12108/'train worker proc 2/4'
Send signal SIGINT to pid 12105/'train worker proc 3/4'
Send signal SIGINT to pid 13012/'devtrain worker proc 2/4'
Send signal SIGINT to pid 13013/'devtrain worker proc 3/4'
Send signal SIGINT to pid 12899/'dev worker proc 2/4'
Send signal SIGINT to pid 12900/'dev worker proc 3/4'
Send signal SIGINT to pid 13014/'devtrain worker proc 4/4'
Send signal SIGINT to pid 12100/'train worker proc 1/4'
Send signal SIGINT to pid 12101/'train worker proc 2/4'
Send signal SIGINT to pid 12109/'train worker proc 4/4'
Send signal SIGINT to pid 13011/'devtrain worker proc 1/4'
Send signal SIGINT to pid 12898/'dev worker proc 1/4'
Send signal SIGINT to pid 14968/'NonDaemonicSpawnProcess-16'
Send signal SIGINT to pid 12901/'dev worker proc 4/4'
Send signal SIGINT to pid 15768/'NonDaemonicSpawnProcess-17'
[2024-06-28 00:32:35,464] INFO: Run time: 7:10:22 CPU: 0.60% RSS: 13.41GB VMS: 343.25GB
...

And then it hangs.

% ps a --forest -u $(whoami) -o pid,comm
    PID COMMAND
  35776 sshd
  35777  \_ zsh
  35819      \_ ps
  11806 slurm_script
  11811  \_ python3.11
  11824      \_ python3.11
  11842          \_ python3.11 <defunct>
  11843          \_ python3.11 <defunct>
  11986          |   \_ python3.11 <defunct>
  11844          \_ python3.11
  11983          |   \_ python3.11 <defunct>
  11845          \_ python3.11
  11982              \_ python3.11
  11984              \_ watch memory
  12104              \_ MPD worker 0
  12107              \_ MPD worker 1
  12111              \_ MPD worker 2
  12114              \_ MPD worker 3
  13134              \_ python3.11
  13146              \_ MPD worker 0
  13147              \_ MPD worker 1
  13148              \_ MPD worker 2
  13149              \_ MPD worker 3
  13388              \_ MPD worker 0
  13389              \_ MPD worker 1
  13390              \_ MPD worker 2
  13391              \_ MPD worker 3
  13679              \_ TDL worker 0
  13920              |   \_ MPD worker 0
  14022              |   \_ MPD worker 1
  14131              |   \_ MPD worker 2
  14241              |   \_ MPD worker 3
  14967              \_ TDL worker 0
  15213              |   \_ MPD worker 0
  15317              |   \_ MPD worker 1
  15462              |   \_ MPD worker 2
  15555              |   \_ MPD worker 3
  15766              \_ TDL worker 0
  16005                  \_ MPD worker 0
  16110                  \_ MPD worker 1
  16210                  \_ MPD worker 2
  16309                  \_ MPD worker 3

And stack trace:

% py-spy dump -p 11845
Process 11845: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lzGIgU9Oy8je/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)

Thread 11845 (idle): "MainThread"
    backward (torch/autograd/__init__.py:251)
    backward (torch/_tensor.py:492)
    train_epoch (returnn/returnn/torch/engine.py:421)
    train (returnn/returnn/torch/engine.py:236)
    execute_main_task (returnn/returnn/__main__.py:542)
    main (returnn/returnn/__main__.py:736)
    <module> (returnn/rnn.py:11)
Thread 13782 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 14735 (idle)
Thread 14734 (idle)
Thread 14737 (active)
Thread 14736 (idle)
Thread 15070 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 15868 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

dmesg:

...
[  +0.000127] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[  +0.000002] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[  +0.000016] pcieport 0000:80:03.0: AER: device recovery failed
[  +0.000002] pcieport 0000:80:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:80:03.0
[  +0.000020] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  +0.000184] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00004020/00000000
[  +0.000156] pcieport 0000:80:03.0:    [ 5] SDES                  
[  +0.000148] pcieport 0000:80:03.0:    [14] CmpltTO                (First)
[  +0.000153] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[  +0.000001] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[  +0.130524] NVRM: GPU at PCI:0000:82:00: GPU-0c6d20ff-f6a7-957f-af96-ffdf1b880057
[  +0.000012] NVRM: GPU Board Serial Number: 0321217096330
[  +0.000003] NVRM: Xid (PCI:0000:82:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[  +0.000006] NVRM: GPU 0000:82:00.0: GPU has fallen off the bus.
[  +0.000003] NVRM: GPU 0000:82:00.0: GPU serial number is 0321217096330.
[  +0.000013] NVRM: A GPU crash dump has been created. If possible, please run
              NVRM: nvidia-bug-report.sh as root to collect this data before
              NVRM: the NVIDIA kernel module is unloaded.
[  +0.031970] pcieport 0000:80:03.0: AER: Root Port link has been reset (0)
[  +0.000032] pcieport 0000:80:03.0: AER: device recovery failed
[Jun28 00:03] pcieport 0000:80:03.0: AER: Uncorrected (Fatal) error received: 0000:80:03.0
[  +0.000013] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  +0.000292] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00004020/00000000
[  +0.000237] pcieport 0000:80:03.0:    [ 5] SDES                  
[  +0.000240] pcieport 0000:80:03.0:    [14] CmpltTO                (First)
[  +0.000245] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[  +0.000002] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[  +0.149679] pcieport 0000:80:03.0: AER: Root Port link has been reset (0)
[  +0.000031] pcieport 0000:80:03.0: AER: device recovery failed

It might be just a GPU issue.

Maybe related:

#314, although that was Horovod.
#1520, some error (not timeout as here), but then also hang.
#1496, some error, but then also hangs. Although the hang was then attributed to another problem (https://github.com/rwth-i6/returnn/issues/1497).
https://github.com/rwth-i6/sisyphus/issues/176 and https://github.com/rwth-i6/sisyphus/pull/177

Jun 28 '24 08:06 albertz

Sometimes also like this:

...

ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step
ep 28 train, step 56, ctc_4 2.049, ctc_8 1.746, ctc 1.688, num_seqs 8, max_size:time 276496, max_size:out-spatial 62, mem_usage:cuda:2 6.3GB, 0.678 sec/step
ep 28 train, step 57, ctc_4 2.239, ctc_8 1.990, ctc 1.961, num_seqs 8, max_size:time 278959, max_size:out-spatial 61, mem_usage:cuda:0 6.3GB, 0.653 sec/step
ep 28 train, step 57, ctc_4 2.137, ctc_8 1.780, ctc 1.708, num_seqs 8, max_size:time 280104, max_size:out-spatial 60, mem_usage:cuda:3 6.3GB, 0.674 sec/step
ep 28 train, step 57, ctc_4 2.338, ctc_8 1.937, ctc 1.926, num_seqs 9, max_size:time 252480, max_size:out-spatial 55, mem_usage:cuda:1 6.3GB, 0.693 sec/step
ep 28 train, step 57, ctc_4 3.121, ctc_8 2.822, ctc 2.807, num_seqs 8, max_size:time 276760, max_size:out-spatial 64, mem_usage:cuda:2 6.3GB, 0.675 sec/step
ep 28 train, step 58, ctc_4 2.397, ctc_8 2.037, ctc 1.967, num_seqs 9, max_size:time 255120, max_size:out-spatial 65, mem_usage:cuda:3 6.3GB, 0.631 sec/step
ep 28 train, step 58, ctc_4 2.598, ctc_8 2.242, ctc 2.165, num_seqs 8, max_size:time 279224, max_size:out-spatial 56, mem_usage:cuda:0 6.3GB, 0.657 sec/step
ep 28 train, step 58, ctc_4 2.433, ctc_8 2.155, ctc 2.129, num_seqs 10, max_size:time 228024, max_size:out-spatial 63, mem_usage:cuda:1 6.3GB, 0.628 sec/step
MEMORY: sub proc TDL worker 0(5599) increased RSS: rss=524.3MB pss=372.6MB uss=356.5MB shared=167.8MB
MEMORY: sub proc TDL worker 0(5603) increased RSS: rss=454.3MB pss=302.6MB uss=286.5MB shared=167.7MB
MEMORY: sub proc TDL worker 0(5600) increased RSS: rss=523.1MB pss=371.6MB uss=355.5MB shared=167.6MB
MEMORY: total (main 3853, 2024-06-28, 17:46:24, 21 procs): pss=6.3GB uss=6.0GB
MEMORY: total (main 3850, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.4GB
MEMORY: total (main 3851, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.3GB
MEMORY: sub proc TDL worker 0(5602) increased RSS: rss=542.4MB pss=390.7MB uss=374.6MB shared=167.7MB
MEMORY: total (main 3852, 2024-06-28, 17:46:24, 21 procs): pss=6.4GB uss=6.1GB
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 130292506959872)>, proc 3852.

...

Send signal SIGINT to pid 4123/'train worker proc 4/4'
Send signal SIGINT to pid 4119/'train worker proc 3/4'
Send signal SIGINT to pid 5063/'devtrain worker proc 1/4'
Send signal SIGINT to pid 5064/'devtrain worker proc 2/4'
Send signal SIGINT to pid 5065/'devtrain worker proc 3/4'
Send signal SIGINT to pid 5066/'devtrain worker proc 4/4'
Send signal SIGINT to pid 5602/'NonDaemonicSpawnProcess-15'
Send signal SIGINT to pid 4604/'dev worker proc 2/4'
Send signal SIGINT to pid 4611/'dev worker proc 4/4'
Send signal SIGINT to pid 4607/'dev worker proc 3/4'
Send signal SIGINT to pid 4601/'dev worker proc 1/4'
Send signal SIGINT to pid 4114/'train worker proc 1/4'
[2024-06-28 17:46:56,408] INFO: Run time: 0:03:16 CPU: 1.00% RSS: 21.22GB VMS: 733.07GB

And then hanging.

Procs:

zeyer@cn-252 ~ % ps a --forest -u $(whoami) -o pid,comm
    PID COMMAND                   
   6791 sshd                        
   6792  \_ zsh                     
   6810      \_ ps                  
   3790 slurm_script                
   3804  \_ python3.11              
   3832      \_ python3.11        
   3850          \_ python3.11      
   3989          |   \_ python3.11  
   3995          |   \_ watch memory
   4110          |   \_ MPD worker 0
   4111          |   \_ MPD worker 1
   4115          |   \_ MPD worker 2
   4121          |   \_ MPD worker 3
   4589          |   \_ python3.11  
   4600          |   \_ MPD worker 0
   4603          |   \_ MPD worker 1    
   4608          |   \_ MPD worker 2    
   4612          |   \_ MPD worker 3    
   5057          |   \_ MPD worker 0    
   5059          |   \_ MPD worker 1
   5061          |   \_ MPD worker 2
   5062          |   \_ MPD worker 3
   5603          |   \_ TDL worker 0
   5841          |       \_ MPD worker 0
   5944          |       \_ MPD worker 1
   6053          |       \_ MPD worker 2
   6159          |       \_ MPD worker 3
   3851          \_ python3.11
   3991          |   \_ python3.11
   3993          |   \_ watch memory
   4112          |   \_ MPD worker 0
   4116          |   \_ MPD worker 1
   4120          |   \_ MPD worker 2
   4124          |   \_ MPD worker 3
   4577          |   \_ python3.11
   4602          |   \_ MPD worker 0
   4606          |   \_ MPD worker 1
   4609          |   \_ MPD worker 2
   4614          |   \_ MPD worker 3
   5051          |   \_ MPD worker 0
   5053          |   \_ MPD worker 1
   5055          |   \_ MPD worker 2
   5056          |   \_ MPD worker 3
   5600          |   \_ TDL worker 0
   5842          |       \_ MPD worker 0
   5947          |       \_ MPD worker 1
   6055          |       \_ MPD worker 2
   6163          |       \_ MPD worker 3
   3852          \_ python3.11 <defunct>
   3853          \_ python3.11
   3988              \_ python3.11
   3992              \_ watch memory
   4113              \_ MPD worker 0
   4118              \_ MPD worker 1
   4122              \_ MPD worker 2
   4125              \_ MPD worker 3
   4583              \_ python3.11
   4599              \_ MPD worker 0
   4605              \_ MPD worker 1
   4610              \_ MPD worker 2
   4613              \_ MPD worker 3
   5052              \_ MPD worker 0
   5054              \_ MPD worker 1
   5058              \_ MPD worker 2
   5060              \_ MPD worker 3
   5599              \_ TDL worker 0
   5840                  \_ MPD worker 0
   5945                  \_ MPD worker 1
   6049                  \_ MPD worker 2
   6157                  \_ MPD worker 3

Those procs just hang. E.g. py-spy:

% py-spy dump -p 3850
Process 3850: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.ns6wGzNHZ8zI/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)

^C

Jun 28 '24 17:06 albertz