flux-core need a way for job manager epilog to implement "partial release"

Problem: the current job manager epilog posts epilog-start and epilog-finish events, and no job resources can be freed until epilog-finish. If the job manager epilog does something that could take a long time on a subset of nodes, then there is no opportunity to release a partial set of resources back to the scheduler.

One idea floated by @grondo was to include an idset in the context of the epilog-finish, like the release event. Both epilog-finish and release would decrement a refcount on a set of execution targets, and the free to the scheduler would occur once a target's count reaches zero.

May 02 '22 21:05 garlick

Just a note that we should update RFC 21 as well.

May 02 '22 21:05 grondo

The rabbit setup which motivated this discussion goes something like this:

User's executable finishes
Flux tells DWS (via kubernetes API) to unmount rabbit FS's on compute nodes 3 (a). If (2) succeeds, compute nodes can be returned to the scheduler 3 (b). If (2) doesn't succeed within time t (for some t we choose), ping kubernetes for a list of nodes that have succeeded unmounting and release those nodes. Continue pinging every t seconds (maybe with some exponential backoff or similar) until complete success is reached.

The way I had planned to implement this was to add a job manager epilog which would send an RPC to a Python script, which would talk to kubernetes and then respond to the RPC either with a complete success message or some list of successful nodes, terminated (hopefully) by a final complete success message.

May 03 '22 18:05 jameshcorbett

I am wondering if there might be some trickery involved from the fact that jobs will have resources (rabbits) that aren't associated with nodes? For instance there will be cases where all the compute nodes are ready to be freed but the rabbits aren’t.

May 03 '22 19:05 jameshcorbett

Yeah that's a bit tricky. We cut some corners in the current job manager / exec system / scheduler design so that we use "execution targets" (broker ranks) to refer to subsets of R. That is what the idset we discussed returning from the job manager epilog would represent. That does not quite work for resources that are not associated with an execution target like apparently the rabbit.

Aside: just had a quick review of RFC 27/Resource Allocation Protocol and noted that we will need to change it to support partial release, since currently a free request just contains the job ID, which the scheduler can use to look up R. There's no way to refer to partial R with an idset.

I didn't have any great ideas offhand. This needs pondering.

May 03 '22 19:05 garlick

@grondo and I talked about it on the coffee call and he proposed putting in partial job R to the free request rather than an idset. He noted that it would work for sched-simple because it frees resources based on the R passed by free_cb in src/common/libschedutil/ops.c, so it could handle partial release that way, but fluxion might be more complicated. I will talk to @dongahn about it later today.

But yeah the trigger to free the rabbits is independent of the compute nodes. For the compute nodes, we can free them once the job shells have stopped and rabbit software tells us that the file systems have been unmounted. For the rabbits, we can free them once the compute nodes can be freed and rabbit software tells us that user data has been safely moved off of the rabbits and the rabbit file systems have been cleaned up.

May 04 '22 21:05 jameshcorbett

There is an additional complication, which is that Flux can technically alert the user that their job has completed before the last condition has been reached (that the rabbit file systems have been cleaned up).

Since I'm guessing that would be very difficult to implement I don't think it would be too big of a deal to ignore that part though and only mark the job as completed once all the conditions are met and all the resources have been freed. If the FS clean-up outright fails, it's fine as long as we can still mark the job as succeeding. If the FS clean-up hangs, there wouldn't be any data loss, the user just wouldn't know that.

May 04 '22 22:05 jameshcorbett

If the FS clean-up hangs, there wouldn't be any data loss, the user just wouldn't know that.

Maybe a solution could be that once the rabbit software tells us that user's data is secure, we post an eventlog entry saying so.

May 04 '22 22:05 jameshcorbett

There is an additional complication, which is that Flux can technically alert the user that their job has completed before the last condition has been reached (that the rabbit file systems have been cleaned up).

I might be misunderstanding, but the job manager should not issue the clean event and the job would not go into the INACTIVE state until all resources have been released, not just the compute nodes. Therefore, an entity that needs to wait until user's data is secure could wait for the clean event for a job, however entities that only need to wait until the job tasks or initial program are complete can just wait for the finish event. (Side note: flux job attach currently waits for the clean event, but should probably only wait for the finish event. I thought there was an open issue on this, but can't find it ATM)

May 05 '22 15:05 grondo

I might be misunderstanding, but the job manager should not issue the clean event and the job would not go into the INACTIVE state until all resources have been released, not just the compute nodes. Therefore, an entity that needs to wait until user's data is secure could wait for the clean event for a job, however entities that only need to wait until the job tasks or initial program are complete can just wait for the finish event.

What I was trying to get at is that with the rabbits there might be three events the user cares about, rather than just clean and finish:

job tasks or initial program finishes (finish)
Rabbit data is secure (no name for this one, but maybe call it data_out)
All resources have been released (clean)

and 2 would always happen before 3.

So yeah the user who cares about their data could wait for 3, since 3 implies 2 in all the cases I can think of right now, but I was wondering whether it would be good to have a separate event, particularly for cases where 3 might not come for a long time after 2 for whatever reason.

May 05 '22 16:05 jameshcorbett

Ok, understood, and that makes sense. A node could be hung in the epilog for some reason (a somewhat common occurrence), so the clean event could be delayed, but the job's rabbit data could still be secure so a separate event here makes sense.

Edit: I wonder if we should favor a specific event name in this case though, or if the case is general enough that we should add a new event to RFC 21. Something to consider.

May 05 '22 16:05 grondo

I don't think that this issue (and https://github.com/flux-framework/flux-core/issues/2204) made it into my production RM features list, but I was thinking about it this week due to a 16 node job on tioga not being released due to one node stuck in epilog. If it's something that would be reasonably easy to implement, it would be nice to have. If it's not easy, it should probably stay lower priority than other things. We need to fix the thing that's hanging in the epilog anyway.

Feb 29 '24 00:02 ryanday36

Since more nodes potentially get idled when a large job is stuck in the prolog compared to a small one, and currently only takes one straggler, it seems like this could get really annoying on el cap as we scale up.

@grondo and I were chatting about things adjacent to this today, and in that discussion (concerning rabbits and when to have the job tools declare a job complete) was that the system epilog script could be decoupled from the job and run after the job reaches INACTIVE state. Then it might be quick and easy to implement partial release of resources to the scheduler as the epilog completes, since it wouldn't be dependent on the big exec system rewrite.

As an alternative to decoupling the system epilog script we could add a new decoupled system script. Maybe some things in the epilog really should run while the job is in CLEANUP state and be "billed" to the user and logged in their eventlog as opposed to treated as system overhead or whatever. Other things like running ansible seem clear candidates for decoupling.

Anyway, the original Flux design (not yet fully realized, but to an extent planned for in the code) was that a job would free multiple R fragments back to the scheduler as the epilog completed in batches. Unfortunately it looks like fluxion ignores the R fragment it receives in the free callback, and instead just uses the job ID (that's in the message for request/response matching purposes) to free all the resources associated with the job on the first free callback.

https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L266

So if we do this, some work will be needed in fluxion.

Mar 07 '24 06:03 garlick

I quickly reviewed the fluxion code, and the change seems manageable. Setting aside the work on qmanager, the fluxion-resource service takes care of deallocating resources by traversing the resource tree with the jobid and removing allocations tagged with that jobid. A viable strategy could involve taking the R fragment during the removal process and deallocating only the resources that the R fragment covers. I can provide more guidance on this task if needed. The ability to return resources partially in large-scale systems is crucial. I can look some more over the weekend and add more suggestions.

https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L1871

Mar 07 '24 06:03 dongahn

Hi @dongahn, thanks for chiming in!

I opened flux-framework/flux-sched#1151 for the fluxion specific discussion.

Mar 07 '24 15:03 garlick

We actually have had user requests to be able to early-release parts of their allocations also. A bit of a refactor is needed, but we need this for elasticity and for production, so I'll try to move this up the priority list. Might take a crack at it myself.

Mar 29 '24 01:03 trws

We should open a separate issue for voluntary early release. That's a pretty interesting idea, and resiliency work done recently for Flux instances and the job shell should make it possible to terminate non-critical shell and broker ranks of the job and the job can keep going.

For this specific issue, note that @garlick has a proof of concept proposed in #5818 which should address the major pain points. (I'm not sure if you meant you were going to take a crack at the job manager or Fluxion support needed for partial release, which is why I mention it)

Mar 29 '24 02:03 grondo

Thanks @grondo, agreed there would be more work to do for eager release. I meant to look at the fluxion side, though at first glance I need to work through where the free RPC actually gets handled right now, as it sits I only see a handler for a cancel RPC, which does do this but has to come through a slightly different path I guess?

Mar 29 '24 17:03 trws

The discussion in flux-framework/flux-sched#1151 may be helpful. The protocol is described in RFC 27, which as @garlick pointed out needs an update since it only currently describes a single free response.

Mar 29 '24 18:03 grondo

The handlers are not message handlers b/c we abstracted the scheduler interface in "libschedutil" (for better or worse):

alloc: https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L190

free: https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L266

Mar 29 '24 18:03 garlick

Thanks both of you, that's very helpful. They're all registered in qmanager.cpp, it's a bit tangled but now that I know where the root is it's much easier to follow.

Mar 29 '24 18:03 trws

I've got some time to work on this in Fluxion and will try to get a WIP PR out with support in the traverser, module, planners/pruning filter, and reader soon. @trws let me know if you're making progress so we can avoid effort duplication.

A thought exercise to check my understanding and determine if it's ever useful to cancel a job in Fluxion based only on the jobid (which will likely be faster than processing a sequence of partial releases):

Will there be a way to distinguish on a per-RPC basis between the cases where the union of all R fragments in a sequence of sched.free RPCs for a single jobid is equal to the the full R for the jobid and the case where it is not? In other words, will there be a way to detect if the first sched.free RPC indicates an eventual full cancellation of the job?

I doubt there's a valid use case for distinguishing between the two. If distinguishing was possible and Fluxion could wait for the last sched.free RPC in the sequence to run a full cancellation based on the jobid, the resources corresponding to R fragments would be blocked from allocation in the resource graph. The reverse (issuing a full cancellation upon reception of the first sched.free RPC in the sequence) could result in multiple bookings as resources stuck in epilog could be allocated.

Since Fluxion is doing something similar to the latter already, has anyone observed multiple bookings when job resource subsets are stuck in epilog?

Apr 04 '24 02:04 milroy

I think your understanding is complete. We haven't observed double bookings with Fluxion (at least not since you fixed that other bug) because flux-core doesn't do partial release now and in fact it's not allowed by RFC 21. It was prototyped in flux-framework/flux-core#5818 but I only tested with sched-simple since I knew Fluxion would not handle it.

We could define a flag that is set on the last R fragment freed for a given job if that turned out to be useful, but it sounds like you are arguing that it would not be and it was just a thought experiment?

Apr 04 '24 02:04 garlick

It was prototyped in https://github.com/flux-framework/flux-core/pull/5818 but I only tested with sched-simple since I knew Fluxion would not handle it.

Ok, I wasn't sure if multiple free responses with R were supported or used in any Flux deployment yet due to PR #5783.

We could define a flag that is set on the last R fragment freed for a given job if that turned out to be useful, but it sounds like you are arguing that it would not be and it was just a thought experiment?

Yeah, I started by thinking a flag or similar might be a good idea, but I ended up not being able to justify it. It was mainly a thought experiment that I posted so others could check the reasoning and add valid use cases.

Apr 04 '24 06:04 milroy

flux-core flux-core copied to clipboard

need a way for job manager epilog to implement "partial release"

flux-core
flux-core copied to clipboard