flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

Resource: support partial cancel of resources external to broker ranks

Open milroy opened this issue 1 year ago • 16 comments

Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. ~This PR adds a goto statement to return 0 under this circumstance.~ This PR is significantly updated in scope to reflect more comprehensive understanding of the problem.

This line is causing some of the errors reported in the related issues: https://github.com/flux-framework/flux-sched/blob/996f999c9bc398845f91ea58851e517de63ad677/resource/traversers/dfu_impl_update.cpp#L436 That error condition (rc !=0) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.

This PR adds support for a cleanup cancel post partial cancel that skips the inapplicable error check for non-existent planner spans in the error criterion for removal of planner_multi spans.

Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (ALL:core) and pruning filters are set for the resources excluded from broker ranks (e.g., ALL:ssd). In preliminary testing, supporting both was challenging because with ALL:core configured, the final .free RPC frees all planner_multi-tracked resources, which prevents a cleanup, full cancel. However, tracking additional resources (e.g., ALL:ssd) successfully removes resource allocations on those vertices only with a cleanup cancel.

This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.

Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.

milroy avatar Sep 05 '24 00:09 milroy

@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion?

It would be good to know if the partial cancel behaves as expected when encountering this issue.

milroy avatar Sep 05 '24 01:09 milroy

Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while.

I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs.

jameshcorbett avatar Sep 05 '24 04:09 jameshcorbett

After more thought, I think we need to add a testsuite check for this issue.

milroy avatar Sep 05 '24 04:09 milroy

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the ssd resources are allocated.

jameshcorbett avatar Sep 05 '24 15:09 jameshcorbett

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: https://github.com/flux-framework/flux-coral2/pull/208, but it would be good to have a test in sched, too.

milroy avatar Sep 05 '24 17:09 milroy

What are the scheduler and queue policies set to in the coral2 tests?

Whatever the defaults are I think, there's no configuration done.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

Thanks for the pointer, I should be able to look into this tomorrow!

jameshcorbett avatar Sep 06 '24 06:09 jameshcorbett

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

milroy avatar Oct 17 '24 07:10 milroy

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

Working on that now.

jameshcorbett avatar Oct 17 '24 19:10 jameshcorbett

Could you refactor the tests and put them under t/issues/ so the issue test driver script executes them?

milroy avatar Oct 17 '24 19:10 milroy

I realized the brokerless resource cancellation currently implemented in this PR will lead to highly undesirable behavior. All the brokerless vertices will get cancelled in the first partial cancel traversal. That means that they will be allocatable before the final .free which can lead to state inconsistency between sched and core.

The best approach may be to propagate the final free to the traverser, which can then execute a cleanup, full cancel upon detecting brokerless resources. That has the added bonus of not triggering the if (full_removal || final) condition in qmanager, which will allow that condition to remain an error state.

A simpler alternative would be to always execute a full cancel when the final .free RPC is received. The disadvantage of that approach is additional overhead if there are no brokerless resources.

Thoughts @trws and @jameshcorbett?

milroy avatar Oct 21 '24 02:10 milroy

If propagating the final free state to the traverser is feasible, I think that's a great choice, if not doing the full cancel unconditionally would be a good stopgap while we figure things out.

trws avatar Oct 21 '24 17:10 trws

I was able to implement propagating the free state to the traverser. I ran performance comparisons between the two approaches (propagation vs unconditional full cleanup cancel) and observed that the unconditional full cleanup cancel is almost always faster than propagating the free state. The exception appears to be when processing more than 1 free RPC for a single jobid.

However, I also observed errors with the free state propagation implementation under certain pruning filter configurations. For example, with the ALL:core,ALL:ssd pruning filter configuration the partial removal exits with an error on the rack1 vertex in the test JGF (issue1284.json). The error occurs when attempting to remove core types from the pruning filter span because the broker rank allocation being released includes N cores. However, rack1 only contains ssd vertices, so the attempt to remove N cores from the pruning filter core span fails because N > 0 (i.e., it attempts to remove more resources than exist).

I can't think of a way to fix that error without adding substantial complexity to the traverser and planner logic. Given these findings I think the best course is to do a full cancel unconditionally upon receipt of the final .free. @trws what do you think?

milroy avatar Oct 24 '24 02:10 milroy

The exception appears to be when processing more than 1 free RPC for a single jobid.

I expect this will be the most common case in production.

In terms of performance difference, I timed the cancellation in qmanager for testsuite tests 1026 and 1027. For cancellations with a single .free RPC for test 1027, the full cleanup cancel takes 102.3 microseconds on average (20 tests) in comparison with the propagating the free state implementation, which takes 121.8 microseconds (19% difference).

For test 1026 with a series of four .free RPCs, the average (across 20 tests) sum of the four cancels is 495.2 microseconds for the full cleanup cancel in comparison with 424.3 microseconds for the propagating the free state implementation (14% difference).

milroy avatar Oct 24 '24 19:10 milroy

It sounds like the complexity favors doing the unconditional full cancel, and the performance doesn't really push away from that much either. Would you agree @milroy? If so, then lets do that. I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

trws avatar Oct 24 '24 19:10 trws

Would you agree @milroy?

Yes, that's my conclusion, too.

I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

Agreed. There's extra complexity and brittleness related to dealing with the planners in the context of allocation and resource dynamism and it manifests in unexpected ways.

milroy avatar Oct 24 '24 22:10 milroy

Looks like there's a failure on el8, specifically failing to find an error output that's there?

  [Oct24 23:36] sched-fluxion-qmanager[0]: remove: Final .free RPC failed to remove all resources for jobid 88030052352: Success
  test_must_fail: command succeeded: grep free RPC failed to remove all resources log.out
  not ok 8 - no fluxion errors logged

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

trws avatar Oct 28 '24 16:10 trws

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

Beyond the CI error you identified, I also need to figure out how to test an error condition related to running the full cleanup cancel after a partial cancel. So no, unfortunately this isn't ready to go yet.

milroy avatar Oct 30 '24 01:10 milroy

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

The final flag on the .free RPC isn't getting set for t1027 on el8 for some reason.

milroy avatar Oct 30 '24 02:10 milroy

@trws I thought of a much simpler and cleaner way to check whether a planner_multi sub span was removed by a prior partial cancel: https://github.com/flux-framework/flux-sched/pull/1292/commits/a42de67ef8dcfe0e8bbdc9f49c37ed2d28f66f6d

The PR is ready for a final review.

milroy avatar Oct 31 '24 00:10 milroy

I should add that I don't think the PR needs additional testsuite tests for whether a planner_multi sub span was removed by a prior partial cancel. The change is very small and should have been included in the first partial cancel PR.

milroy avatar Oct 31 '24 07:10 milroy

Looks good to me @milroy, thanks for all the work on this!

trws avatar Nov 01 '24 20:11 trws

Thanks for the review! Setting MWP.

milroy avatar Nov 02 '24 00:11 milroy

Codecov Report

Attention: Patch coverage is 53.33333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 75.2%. Comparing base (93589f5) to head (655dcc3). Report is 42 commits behind head on master.

Files with missing lines Patch % Lines
qmanager/policies/base/queue_policy_base.hpp 45.4% 6 Missing :warning:
resource/planner/c/planner_multi_c_interface.cpp 66.6% 1 Missing :warning:
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1292     +/-   ##
========================================
- Coverage    75.3%   75.2%   -0.2%     
========================================
  Files         111     111             
  Lines       15979   15983      +4     
========================================
- Hits        12048   12032     -16     
- Misses       3931    3951     +20     
Files with missing lines Coverage Δ
resource/traversers/dfu_impl_update.cpp 75.2% <100.0%> (-1.2%) :arrow_down:
resource/planner/c/planner_multi_c_interface.cpp 62.6% <66.6%> (+0.1%) :arrow_up:
qmanager/policies/base/queue_policy_base.hpp 78.2% <45.4%> (-0.9%) :arrow_down:

... and 4 files with indirect coverage changes

codecov[bot] avatar Jan 31 '25 01:01 codecov[bot]