scheduler has inconsistent allocated resources after Flux crash/restart with housekeeping enabled

Open grondo opened this issue 1 year ago • 0 comments

Unclear if this was a bug in core or flux-sched, but opening an issue here because we should probably see if this is reproducible.

After a broker crash (oom-kill) and restart, Fluxion had many resources allocated for which there were no active associated jobs. The sequence of events appeared to be:

A broker crash with running jobs
Housekeeping was enabled
On restart, running jobs had an exec exception raised ("Failed to create guest ns")
When resources were released, housekeeping failed to start on ranks because they hadn't yet come online in the new instance

We may be able to manually reproduce and test this scenario on a test cluster by killing a broker with running jobs, stopping all the compute node brokers, and restarting.

Jul 19 '24 13:07 grondo