flux-core
flux-core copied to clipboard
scheduler has inconsistent allocated resources after Flux crash/restart with housekeeping enabled
Unclear if this was a bug in core or flux-sched, but opening an issue here because we should probably see if this is reproducible.
After a broker crash (oom-kill) and restart, Fluxion had many resources allocated for which there were no active associated jobs. The sequence of events appeared to be:
- A broker crash with running jobs
- Housekeeping was enabled
- On restart, running jobs had an exec exception raised ("Failed to create guest ns")
- When resources were released, housekeeping failed to start on ranks because they hadn't yet come online in the new instance
We may be able to manually reproduce and test this scenario on a test cluster by killing a broker with running jobs, stopping all the compute node brokers, and restarting.