flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

Flux stuck during shutdown, `flux queue status -v` shows many jobs running

Open grondo opened this issue 1 year ago • 3 comments

On tuolumne, flux shutdown was stuck in flux queue idle. There are no running jobs known to job-list, but flux queue status -v shows 80 running jobs:

# flux queue status -v | grep running
80 running jobs
# flux jobs --stats-only
0 running, 

One thing that I notice in the logs is that nodes were shutdown while housekeeping was running.

I have no idea if that is related.

I can't seem to get any more information out of the system so I'm just going to kill off the flux queue idle process and let the system shutdown.

grondo avatar Oct 03 '24 17:10 grondo

Example of housekeeping errors:

job-manager.err[0]: housekeeping: tuolumneXXX (rank XXX) fALfUWKpRdZ: No route to host
job-manager.err[0]: housekeeping: tuolumneYYY (rank YYY) fALfUW3WZXm: No route to host
job-manager.err[0]: housekeeping: tuolumneZZZ (rank ZZZ) fALfQAq4Fvo: No route to host

grondo avatar Oct 03 '24 18:10 grondo

Wow that is really strange. It's almost as if the job manager's running job count is just wrong. Looking through that code, it's hard to see how it could be, at least not without the job-list count also being wrong since both counts are driven by job events.

The housekeeping errors are probably to be expected and shouldn't be related to the running job count since housekeeping starts when the job transitions to INACTIVE.

garlick avatar Oct 03 '24 19:10 garlick

The housekeeping errors are probably to be expected and shouldn't be related to the running job count since housekeeping starts when the job transitions to INACTIVE.

Ok, I only mentioned that since it was the only error I saw in the logs.

grondo avatar Oct 03 '24 19:10 grondo