deepops
deepops copied to clipboard
fetching PIDs for timeout jobs for cleanup sometimes fail to kill processes
Under some circumstances slurm epilog fail to cleanup processes because of parsing of nvidia-smi pmon
From /var/log/slurm/prolog-epilog
- for i in $(nvidia-smi pmon -c 1 | tail -n+3 | awk '{print $2}' | grep -v -)
- logger -s -t slurm-epilog 'Killing residual GPU process Idx ...' <13>Sep 10 15:12:33 slurm-epilog: Killing residual GPU process Idx ...
- kill -9 Idx <---- this is not a valid PID. /etc/slurm/epilog.d/50-exclusive-gpu: line 12: kill: Idx: arguments must be process or job IDs
Regular output should work well, but if for some reason output will contain one more comment line before processes list it will catch non PID line
root@hpc-hostname:~# nvidia-smi pmon -c 1 # gpu pid type sm mem enc dec command # Idx # C/G % % % % name 0 - - - - - - - 1 - - - - - - - 2 - - - - - - - 3 - - - - - - - 4 - - - - - - - 5 - - - - - - - 6 - - - - - - - 7 - - - - - - -