returnn
returnn copied to clipboard
Horovod single-node multi-GPU training, hangs on crash
When the training crashes (e.g. GPU out-of-memory, or got inf/nan, or whatever), it often happens that the process (SGE job, Slurm job) is just hanging and not exiting.
Commit a31f683b006fac8328b1eccfa80d035930254b46 might have improved things. But not sure. With that commit, all the procs seems to reach the exit code. I see 4 times Trainer not finalized, quitting. (pid ...) in the log (for 4 GPUs). However, it still hangs. The last message in the log:
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
When I login on the node, I also see that the procs are still running (via pstree -p):
├─sge_execd(2030)─┬─load_sensor.sh(2268)
│ ├─sge_shepherd(15966)───python3(16362)───bash(16529)─┬─mpirun(16554)───{mpirun}(16738)
│ │ ├─python3(16530)
│ │ ├─python3(16531)
│ │ └─python3(16555)─┬─{python3}(16708)
│ │ └─{python3}(16710)
│ ├─{sge_execd}(2031)
│ ├─{sge_execd}(2032)
│ ├─{sge_execd}(2033)
│ └─{sge_execd}(2034)
I assume they hang at quit. Maybe at the atexit handler of Horovod or so? Once I send SIGUSR1 to them, they immediately quitted.
Maybe OpenMPI #3380 is related to that now?