flux-sched Add support for broker rank-based partial release

Fluxion issue #1151 and flux-core issue 4312 describe the need for partially releasing broker resources. The first implementation of the functionality is scoped to releasing all resources managed by a single broker rank per RPC. Elasticity considerations will require arbitrary release of resources, but this capability may only be needed in the scheduler.

In its current WIP state, this PR adds capability to identify all boost graph vertices managed by a broker rank in the RV1 reader. The reader builds and returns a map keyed by the resource types, with values corresponding to the number of resources of that type to be removed. The cancellation of the vertices occurs in the RV1 reader, and the partial cancellation an pruning filter updates occurs in the traverser. Note that the map is under-specified relative to number of resources in the resource graph. The traverser uses vectors of types and counts that are ordered by their visit order. Since the reader may unpack and find the boost vertices in any order (and lookups need to be fast) a map is the better choice for container. The complexity of translating between map and vector containers was added to the planners in PR #1061.

The planner and planner_multi need modifications to handle span reduction based on the assembled reduction counts and types. They must deal with 0 entries for reduction, correctly treat a sequence of span reductions as a single span removal. A single partial cancellation that contains all planner or planner_multi resources must behave like a full span removal. The planners must also return whether the span was totally removed to the client.

The traverser begins a depth-first visit that stops as soon as it encounters a vertex untagged and cancelled by the RV1 reader. The traverser needs a new enum class to designate job modification traversal types. If the type is PARTIAL_CANCEL, the traverser must only untag and erase the jobid if the planner indicates that the removal constitutes a full removal.

The PR adds support in qmanager for unpacking R and calling the cancel REAPI module function in policy derived classes. It avoids locking up the queues upon partial or full cancellation.

Items remaining to be completed to remove the WIP tag:

[x] Add RPC handling
[x] Add REAPI support
[x] Add resource module implementation
[x] Add support in qmanager
[x] Add tests in testsuite

Apr 08 '24 08:04 milroy

@zekemorton, @trws, @jameshcorbett: please take a look at the current state when you get a chance. Getting your feedback soon will help me make sure this PR is on the right track.

Jun 27 '24 06:06 milroy

@milroy - Is this branch ready for any testing? I've got @garlick's houskeeping PR (flux-framework/flux-core#5818) installed on a test cluster, but with sched-simple for now since it supports partial release. If it would be helpful I could start testing with this branch instead. Our hope is to include housekeeping as an experimental feature in the next release, currently scheduled for July 2.

Jun 28 '24 19:06 grondo

Thanks for the initial feedback @jameshcorbett !

Jul 02 '24 05:07 milroy

The remaining problem with this PR is related to the new requirement for queue_policy_base_t objects to call REAPI. See here: https://github.com/flux-framework/flux-sched/blob/37457071cd3dbcb7116661b2837b56027d99abaa/qmanager/policies/queue_policy_bf_base_impl.hpp#L170

This causes tests that set FLUX_SCHED_MODULE=none to segfault.

So far, sched-resource API calls can only be made via template, derived classes like bf_base and fcfs. I supposed I could rework queue_policy_base_t, to be templated on reapi_type, but that will probably be a big change. Another option might be to detect whether Fluxion is loaded and made the appropriate remove call in qmanager_callbacks.cpp. @trws or @jameshcorbett, do you have suggestions here?

Jul 02 '24 05:07 milroy

The test failure in t1009-recovery-multiqueue.t occurs here: https://github.com/flux-framework/flux-sched/blob/1161e61a1ca56d6a4cbad6e137e41bbe968e5444/t/t1009-recovery-multiqueue.t#L81

It hangs almost all the time, but I've seen it succeed once. I did check to make sure the cancellation on the previous line is working as expected.

Jul 04 '24 02:07 milroy

The other test that fails is t1013-qmanager-priority.t, and the failing subtest is here: https://github.com/flux-framework/flux-sched/blob/1161e61a1ca56d6a4cbad6e137e41bbe968e5444/t/t1013-qmanager-priority.t#L135

Edit: only t1009 fails in GitHub CI testing.

Jul 04 '24 02:07 milroy

These two tests are the ones I've seen break most often lately. Almost anything that changes the qmanager sched loop ordering or pending handling can break those. I'll take a look and see if I see any of the "usual suspects" that kept biting me in the other PRs.

Jul 04 '24 03:07 trws

Nothing is jumping out at me as likely in the new remove methods, but I think you can simplify those pretty significantly. Specifically, I'd say consider doing the "cancel" in the new method, then invoking the old remove method if it's a full cancel. It may need some conditions there to make sure it only does the cancel in the right case and so-forth, but that should cut down the duplication and keep you from having to re-implement the cancellation logic in the new ones.

Jul 04 '24 03:07 trws

Sorry, I meant the "in-progress sched loop cancellation logic" rather than the partial cancellation logic. Words hard 😐

Jul 04 '24 03:07 trws

Specifically, I'd say consider doing the "cancel" in the new method, then invoking the old remove method if it's a full cancel.

The way partial cancel is currently implemented is it is the same as a full cancel if it cancels all the job's resources (or all the job's remaining resources if it's the final RPC in a sequence of partial cancels). So it's redundant to invoke a full cancel afterward.

Jul 06 '24 01:07 milroy

I found a way to get the tests to pass but I'm not too happy with it. If the sched loop is invoked after partial cancel in the derived-class remove everything works as expected...

Jul 06 '24 01:07 milroy

@grondo this PR is ready for the test cluster whenever you get a chance.

I'm going to update the commit messages soon and will drop the WIP label.

Jul 06 '24 01:07 milroy

Nice! I'll put it on mine tomorrow.

Jul 06 '24 02:07 garlick

I found a way to get the tests to pass but I'm not too happy with it. If the sched loop is invoked after partial cancel in the derived-class remove everything works as expected...

With the changes in this PR the 10,000 job submit and cancel test with blocked jobs now runs in 31-32 seconds (versus 47s with master) in the bookworm container on my laptop. @trws you may want to run a test as well.

Edit: apparently I wasn't testing against current master, because I'm seeing 31s test runtimes currently.

Jul 06 '24 03:07 milroy

This PR is ready for review, and I'll tag folks after the weekend.

Jul 06 '24 08:07 milroy

On my test cluster (v0.36.0-26-g34b5f692) right now and things are looking great so far for simple runs. I have to run an errand and then I"ll try some more stuff. Any particular things I should try to probe?

Edit: what I've done so far is 3-4 iterations of

configure housekeeping with a 30s timeout
add sleep inf to one node's housekeeping tasks (to simulate a stuck node)
submit a job that runs on N nodes including the stuck one
after 30s, N-1 nodes are freed I can submit another job that runs on N-1 nodes
run flux housekeeping kill --all
observe that the stuck node is now drained + undrain
goto step 3

Jul 06 '24 15:07 garlick

Specifically, I'd say consider doing the "cancel" in the new method, then invoking the old remove method if it's a full cancel.

The way partial cancel is currently implemented is it is the same as a full cancel if it cancels all the job's resources (or all the job's remaining resources if it's the final RPC in a sequence of partial cancels). So it's redundant to invoke a full cancel afterward.

That makes sense, the idea though is to avoid the logic triplication for removals. As an alternative, could each of the derived classes provide a function just implementing the difference, which looks like this part:

if ( (rc = reapi_type::cancel (h, job_it->second->id, R, true,
                                       full_removal) != 0))
            break;

So that way we can have two entry points to remove implemented in the queue_policy_base class, the partial cancel one that takes a handle and R, and the id-only one that would delegate to the partial cancel one. The partial cancel remove could then call a virtual method for just the cancel call if it has been provided with R and a handle, and we only need one copy of the state logic, loop and blocked job handling logic.

Anyway, I'll do a deeper look here, but keeping the control flow for qmanager sane is really, really tricky because of the way it's laid out right now. If possible it would be good to minimize the number of alternate code paths we need to worry about that deal with the state machine.

Jul 06 '24 19:07 trws

On my test cluster (v0.36.0-26-g34b5f692) right now and things are looking great so far for simple runs.

Good news so far; thanks for the testing! It's needed given the complexity of this change.

Any particular things I should try to probe?

Yes, please check that a sequence of partial releases is equivalent to a full cancel. So submit a job of N nodes and issue N partial releases. Then try to schedule the exact same nodes. Another good test is to release ranks with different resources like cores and GPUs (if you have or can simulate them). Finally, please test partial release of non exclusive jobs, especially with cores and GPUs, if possible.

Jul 06 '24 19:07 milroy

That makes sense, the idea though is to avoid the logic triplication for removals. As an alternative, could each of the derived classes provide a function just implementing the difference

So that way we can have two entry points to remove implemented in the queue_policy_base class, the partial cancel one that takes a handle and R, and the id-only one that would delegate to the partial cancel one.

Ah, I see what you mean. I'll get to work on it!

If possible it would be good to minimize the number of alternate code paths we need to worry about that deal with the state machine.

Yeah, we don't want to make it any more complicated than absolutely necessary.

Jul 06 '24 19:07 milroy

The run_sched_loop in remove is concerning me though. After some testing locally, it looks like the only test that fails with the run_sched_loop removed is t1009-recovery-multiqueue

Yes, that's the reason I added run_sched_loop. It didn't seem like the right thing to do, but it was the only thing I tried that allowed t1009-recovery-multiqueue.t to pass without modifying the test.

Anyway, as I read it, take the run_sched_loop out of the remove method, and make t1009 accept either job3 or job4 starting within the timeout. Just one of them has to, don't care which.

It looks like job3 starts before job4 every time now. Is there a way to use a boolean operator with wait-event, or is it fine to just do flux job wait-event -t 10 ${jobid3} start || flux job wait-event -t 10 ${jobid4} start?

Jul 09 '24 00:07 milroy

Yeah, I'm seeing it start job3 every time as well, but knowing that code I'm not at all certain that's guaranteed. The first thought that comes to mind is make them wait able and use wait to wait for the first of the two (since it grabs whichever finishes first) but then they would have to be switched to short runtimes. I'm not sure if wait-event has a similar option, worth looking (I'm on a phone right now and couldn't find it quickly).

Jul 09 '24 00:07 trws

No option in wait-event to wait on multiple jobs (a good idea though), but it would be trivial to write a python script to do it.

Jul 09 '24 01:07 grondo

No option in wait-event to wait on multiple jobs (a good idea though), but it would be trivial to write a python script to do it.

It's not important for this PR. I can just use || between the two commands. The worst case scenario is the test takes 20s to fail instead of 10. I favor two-command approach over shortening the test job runtimes because that might introduce races with the cancellations given the previous wait-events on start and submit events.

Jul 09 '24 01:07 milroy

Anyway, as I read it, take the run_sched_loop out of the remove method, and make t1009 accept either job3 or job4 starting within the timeout. Just one of them has to, don't care which.

@trws: I've addressed your concern in qmanager, but let me know if you want me to treat t1009 differently.

Jul 09 '24 06:07 milroy

Works for me, if it ever comes up as a perf concern we can deal with it then. Great job @milroy, love the cleanup you did along the way.

Jul 09 '24 13:07 trws

Feel free to land this one, I'll re-up the reformat and then fix the check requirements so we don't have to rerun all the checks here.

Jul 09 '24 13:07 trws

Since this was also asked in slack, here's a small script to wait for the start event for any number of jobs and exit with success on the first:

import sys
import flux
from flux.job import JobID, event_watch_async

found = False
handle = flux.Flux()

def event_cb(future, jobid):
    event = future.get_event()
    print(f"{jobid.f58}: {event}")
    if event.name == "start":
        found = True
        handle.reactor_stop()

for jobid in sys.argv[1:]:
    jobid = JobID(jobid)
    event_watch_async(handle, jobid).then(event_cb, jobid)

handle.reactor_run()
if not found:
    print("start event for job never appeared", file=sys.stderr)
    sys.exit(1)

Jul 09 '24 14:07 grondo

Oh cool, thanks @grondo!

Jul 09 '24 16:07 trws

@milroy, ready for MWP?

Jul 09 '24 19:07 trws

@trws, @milroy, @jameshcorbett, what do you think about tagging another flux-sched release once this goes in? Then we can get a working housekeeping setup on fluke for testing asap..

Jul 09 '24 19:07 grondo

flux-sched flux-sched copied to clipboard

Add support for broker rank-based partial release

flux-sched
flux-sched copied to clipboard