deepops icon indicating copy to clipboard operation
deepops copied to clipboard

fetching PIDs for timeout jobs for cleanup sometimes fail to kill processes

Open ilya-da opened this issue 5 months ago • 1 comments

Under some circumstances slurm epilog fail to cleanup processes because of parsing of nvidia-smi pmon

From /var/log/slurm/prolog-epilog

  • for i in $(nvidia-smi pmon -c 1 | tail -n+3 | awk '{print $2}' | grep -v -)
  • logger -s -t slurm-epilog 'Killing residual GPU process Idx ...' <13>Sep 10 15:12:33 slurm-epilog: Killing residual GPU process Idx ...
  • kill -9 Idx                    <---- this is not a valid PID. /etc/slurm/epilog.d/50-exclusive-gpu: line 12: kill: Idx: arguments must be process or job IDs

Regular output should work well, but if for some reason output will contain one more comment line before processes list it will catch non PID line

root@hpc-hostname:~# nvidia-smi pmon -c 1 # gpu pid type sm mem enc dec command # Idx # C/G % % % % name 0 - - - - - - - 1 - - - - - - - 2 - - - - - - - 3 - - - - - - - 4 - - - - - - - 5 - - - - - - - 6 - - - - - - - 7 - - - - - - -

ilya-da avatar Sep 14 '24 14:09 ilya-da