returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Horovod single-node multi-GPU training, hangs on crash

Open albertz opened this issue 5 years ago • 2 comments

When the training crashes (e.g. GPU out-of-memory, or got inf/nan, or whatever), it often happens that the process (SGE job, Slurm job) is just hanging and not exiting.

albertz avatar Jun 24 '20 09:06 albertz

Commit a31f683b006fac8328b1eccfa80d035930254b46 might have improved things. But not sure. With that commit, all the procs seems to reach the exit code. I see 4 times Trainer not finalized, quitting. (pid ...) in the log (for 4 GPUs). However, it still hangs. The last message in the log:

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

When I login on the node, I also see that the procs are still running (via pstree -p):

           ├─sge_execd(2030)─┬─load_sensor.sh(2268)
           │                 ├─sge_shepherd(15966)───python3(16362)───bash(16529)─┬─mpirun(16554)───{mpirun}(16738)
           │                 │                                                    ├─python3(16530)
           │                 │                                                    ├─python3(16531)
           │                 │                                                    └─python3(16555)─┬─{python3}(16708)
           │                 │                                                                     └─{python3}(16710)
           │                 ├─{sge_execd}(2031)
           │                 ├─{sge_execd}(2032)
           │                 ├─{sge_execd}(2033)
           │                 └─{sge_execd}(2034)

I assume they hang at quit. Maybe at the atexit handler of Horovod or so? Once I send SIGUSR1 to them, they immediately quitted.

Maybe OpenMPI #3380 is related to that now?

albertz avatar Jun 29 '20 08:06 albertz