flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

Resource: support partial cancel of resources external to broker ranks

Open milroy opened this issue 5 months ago • 16 comments

Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. ~This PR adds a goto statement to return 0 under this circumstance.~ This PR is significantly updated in scope to reflect more comprehensive understanding of the problem.

This line is causing some of the errors reported in the related issues: https://github.com/flux-framework/flux-sched/blob/996f999c9bc398845f91ea58851e517de63ad677/resource/traversers/dfu_impl_update.cpp#L436 That error condition (rc !=0) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.

This PR adds support for a cleanup cancel post partial cancel that skips the inapplicable error check for non-existent planner spans in the error criterion for removal of planner_multi spans.

Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (ALL:core) and pruning filters are set for the resources excluded from broker ranks (e.g., ALL:ssd). In preliminary testing, supporting both was challenging because with ALL:core configured, the final .free RPC frees all planner_multi-tracked resources, which prevents a cleanup, full cancel. However, tracking additional resources (e.g., ALL:ssd) successfully removes resource allocations on those vertices only with a cleanup cancel.

This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.

Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.

milroy avatar Sep 05 '24 00:09 milroy