flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

fatal job exception raised on pending jobs when reloading Fluxion modules

Open grondo opened this issue 1 year ago • 1 comments

While reloading fluxion on elcap, several pending jobs were canceled with a fatal job exception such as:

[Jun04 14:42] exception type="alloc" severity=0 note="alloc denied due to type=\"match error\"" userid=765
[  +0.000608] clean

For reference, here's the logs at the time of module reload:

[Jun04 14:42] broker[0]: rmmod sched-fluxion-resource
[ +14.008927] sched-fluxion-resource[0]: responding to post-shutdown sched-fluxion-resource.cancel
[ +14.009019] broker[0]: module sched-fluxion-resource exited
[ +14.012128] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.014486] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.015532] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.045507] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.087013] broker[0]: rmmod resource
[ +14.087290] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.103970] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.104489] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.104968] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.105501] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.105973] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.106463] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.122417] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122435] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122442] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122447] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122451] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122456] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122461] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122465] sched-fluxion-qmanager[0]: responding to post-shutdown sched.disconnect
[ +14.122469] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122474] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122479] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122483] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122488] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122492] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.disconnect
[ +14.122496] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122500] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122505] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122510] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122514] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122518] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122529] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122534] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122538] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122543] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122546] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122550] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122554] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122558] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122563] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122580] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122585] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122590] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122594] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122599] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122603] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122608] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122612] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122635] sched-fluxion-qmanager[0]: responding to post-shutdown sched.cancel
[ +14.122639] sched-fluxion-qmanager[0]: responding to post-shutdown sched.cancel
[ +14.122642] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122648] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122652] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.139690] broker[0]: module sched-fluxion-qmanager exited
[ +14.139745] job-manager[0]: alloc: stop due to disconnect: Success

grondo avatar Jun 04 '24 22:06 grondo

Note that in this particular case, we had to kill off flux module remove sched-fluxion-qmanager which was hanging due to the leaked alloc requests issue (can't find the issue right now, feel free to link it here if found)

grondo avatar Jun 04 '24 22:06 grondo