When using check_cmd_status, nhc leaks watchdog processes
A simple test case:
nhc.conf:
* || TIMEOUT=120
* || check_cmd_status -t 100 -r 0 /bin/true
Now if nhc is run on a node, the main process returns quickly, but it leaks a background nhc watchdog process which still has a TTY and is running sleep. After the sleep expires, it will trigger the watchdog and it will try to kill a process. In theory, if the machine is sufficiently busy or the timeout is sufficiently long, this can end up killing a process using a reused PID number.
from ps axf right after running nhc:
240166 pts/0 S 0:00 /bin/bash /usr/sbin/nhc
240167 pts/0 S 0:00 \_ sleep 100
This is a bit of an annoyance because it also causes this (because of it holding on to a TTY):
$ time ssh <node> "time nhc"
real 0m0.037s
user 0m0.023s
sys 0m0.005s
real 1m40.472s
user 0m0.026s
sys 0m0.005s
$
The watchdog process is spawned at https://github.com/mej/nhc/blob/master/scripts/lbnl_cmd.nhc#L169, and as far as I can see, there is nothing that tries to kill it after the command runs, and there probably should be.
In case bash version is relevant, this happens on version 4.2.46(2)-release (x86_64-redhat-linux-gnu).
Is there a solution? We have the same problem I did not notice this till I installed slurm 22.05 and now the srun hangs till this process are finished. In older slurm versions it was no problem
We have not seen this on 22.05, to the best of my knowledge, but I will look again. But for what it's worth, going over the code again, I believe I can see exactly what the problem is. In fact, I'm not sure if I overlooked something or just wasn't paying close enough attention, but I don't really know how I expected it to work correctly at all, what with everything being in local variables...
Anyhoo...I will prioritize this fix. The gist of it is that nhcmain_watchdog_timer() needs to use global arrays, not local variables, for keeping track of the NHC/subcommand PIDs and the sleep PIDs, and the trap calls need to be set for all the relevant PIDs.
I would love to get your feedback, either on the b08769bb workaround/fix on the 1.4.4-dev branch or the new 1.5 code on the PR #132 branch! I believe either will fix this problem.
In the absence of further information to the contrary, I believe this to be fixed, so I am closing this Issue. Please feel free to reopen if you're still seeing this behavior!