flux-sched
flux-sched copied to clipboard
Partial cancel not releasing rabbit resources (?)
Snipped results of flux dmesg on hetchy:
2024-08-27T01:46:53.149076Z sched-fluxion-resource.err[0]: run_remove: dfu_traverser_t::remove (id=152883667495027712): mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149167Z sched-fluxion-resource.err[0]: ssd0.
2024-08-27T01:46:53.149175Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149181Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149186Z sched-fluxion-resource.err[0]: ssd1.
2024-08-27T01:46:53.149190Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149194Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149199Z sched-fluxion-resource.err[0]: ssd2.
2024-08-27T01:46:53.149204Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149208Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149218Z sched-fluxion-resource.err[0]: ssd3.
2024-08-27T01:46:53.149225Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149234Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149244Z sched-fluxion-resource.err[0]: ssd4.
2024-08-27T01:46:53.149251Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149257Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149267Z sched-fluxion-resource.err[0]: ssd5.
2024-08-27T01:46:53.149279Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149287Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149295Z sched-fluxion-resource.err[0]: ssd7.
2024-08-27T01:46:53.149303Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149309Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149316Z sched-fluxion-resource.err[0]: ssd6.
2024-08-27T01:46:53.149324Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149333Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149342Z sched-fluxion-resource.err[0]: ssd8.
2024-08-27T01:46:53.149349Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149355Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149363Z sched-fluxion-resource.err[0]: ssd9.
2024-08-27T01:46:53.149369Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149377Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149385Z sched-fluxion-resource.err[0]: ssd10.
2024-08-27T01:46:53.149391Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149397Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149405Z sched-fluxion-resource.err[0]: ssd11.
2024-08-27T01:46:53.149415Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149421Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149429Z sched-fluxion-resource.err[0]: ssd12.
2024-08-27T01:46:53.149436Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149443Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149451Z sched-fluxion-resource.err[0]: ssd13.
2024-08-27T01:46:53.149457Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149464Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149472Z sched-fluxion-resource.err[0]: ssd14.
2024-08-27T01:46:53.149480Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149487Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149495Z sched-fluxion-resource.err[0]: ssd15.
2024-08-27T01:46:53.149502Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149508Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149516Z sched-fluxion-resource.err[0]: ssd16.
2024-08-27T01:46:53.149522Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149528Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span aft
2024-08-27T01:46:53.149544Z sched-fluxion-resource.err[0]: partial_cancel_request_cb: remove fails due to match error (id=152883667495027712): Success
2024-08-27T01:46:53.150544Z sched-fluxion-qmanager.err[0]: remove: .free RPC partial cancel failed for jobid 152883667495027712: Invalid argument
2024-08-27T01:46:53.150564Z sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=parrypeak id=152883667495027712): Invalid argument
2024-08-27T01:50:11.281162Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded
2024-08-27T01:50:52.232045Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded
Also I think I observed that rabbit resources are not released by the scheduler when jobs complete. For instance, I ran a one-node rabbit job, and then tried to submit another one only for it to become stuck in SCHED.
Any thoughts on what might be going on @milroy ?