flux-sched
flux-sched copied to clipboard
Resource: support partial cancel of resources external to broker ranks
Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan
function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL
. ~This PR adds a goto
statement to return 0
under this circumstance.~ This PR is significantly updated in scope to reflect more comprehensive understanding of the problem.
This line is causing some of the errors reported in the related issues: https://github.com/flux-framework/flux-sched/blob/996f999c9bc398845f91ea58851e517de63ad677/resource/traversers/dfu_impl_update.cpp#L436 That error condition (rc !=0
) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especially core
, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free
RPC fails to remove an ssd
vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.
This PR adds support for a cleanup cancel
post partial cancel that skips the inapplicable error check for non-existent planner
spans in the error criterion for removal of planner_multi
spans.
Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (ALL:core
) and pruning filters are set for the resources excluded from broker ranks (e.g., ALL:ssd
). In preliminary testing, supporting both was challenging because with ALL:core
configured, the final .free
RPC frees all planner_multi
-tracked resources, which prevents a cleanup, full cancel
. However, tracking additional resources (e.g., ALL:ssd
) successfully removes resource allocations on those vertices only with a cleanup cancel.
This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.
Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.