flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

feature tracking: Advanced Reservations (DATs)

Open grondo opened this issue 2 years ago • 19 comments

This is a tracking issue for an implementation of DATs. The requirements as I understand them include:

  • an interface for specifying a set of resources (as a resource spec or perhaps specific resources) that are reserved for a specific user set in a specified time range. (This might be a Fluxion-only interface, since the simple scheduler has no actual schedule)

  • Allow the prescribed user set to submit jobs before the reservation

  • [ ] flux-framework/flux-sched#963

  • [ ] flux-framework/flux-sched#1013

  • [ ] allow early interaction with instances that will be started as part of a future job (DAT) (optional)

  • [ ] #5531

  • [ ] restrict user set with access to DAT job

  • [ ] new job submission utility which allows submission of DAT/reserved job with deferred start time, restricted user list, etc.

  • [ ] accounting support for DAT jobs

grondo avatar May 24 '23 15:05 grondo

Linked flux-framework/flux-sched#1013 above. According to @trws and @milroy once that PR is merged, we will have much of the support needed in Fluxion to schedule a DAT.

grondo avatar May 24 '23 22:05 grondo

Idea: add a new job state RESERVED between SCHED and RUN such that a job request with a special attribute could get its alloc response R from the scheduler early, in advance of the the starttime field in R,. The job manager and the rest of flux could just treat that like any other allocation, except the job would remain in RESERVED state until starttime arrives. With R stored in the KVS, the sched.hello protocol could throw it back to the scheduler on a restart.

This would work for any job, including a sub-instance.

An advance static R would be more susceptible to having resources go bad before the job starts. With a flux instance, we could initially just set the quorum value to some fraction of the total and let the instance start with some non-critical nodes down.

garlick avatar Jan 30 '24 03:01 garlick

For clarity, is the benefit of having the RESERVED state and advance R available so that an subinstance could be configured with the eventual resources assigned to the job? Would this also have some benefit for normal jobs? Adding a new state just for that purpose feels like it could be short sighted (though I'm probably missing the other benefits!), especially if we plan one day to support instances that can grow onto unknown resources instead of just known resources.

Another thing I'll just throw out there is that there is already a way to hold a job between SCHED and RUN by issuing a prolog-start event. Perhaps requiring R be known before hand could be a special case of the reservation alloc request (e.g. a reservation can include a hostlist or a node count like in Slurm) to satisfy the DAT-as-an-instance use case and a jobtap plugin could prevent transition to the RUN state and could perhaps even handle startup of a single broker instance with no resources, configured to use FLUB.

grondo avatar Jan 30 '24 15:01 grondo

I guess the main appeal to me is that we wouldn't need to have a separate set of tools and rules for reservations like slurm. A job request would be sufficient to request a reservation, the existing scheduler interfaces would be sufficient to communicate the results, existing tools could be used to view/update reservations (since they are just jobs). In a way regular jobs then are just a degenerate case of a reservation anyway, where time in RESERVED is very short, so the plan doesn't introduce niche features that would have less testing than mainstream ones.

But yeah it builds upon the existing resource model which is fundamentally static. However, as we add dynamic resource capability to flux, this could grow too. For example, maybe a job could request to start as soon as an initial resource request can be fulfilled, and also hold a reservation that would be added to the job later? Maybe we could also add a way for the scheduler to modify an already allocated R, such as replacing nodes that are no longer available, and we could make that work the same for running and reserved jobs.

Anyway I'm not hard over on this idea - just throwing it out there to see if it sticks. Sounds like it's sliding down the wall a bit :-)

garlick avatar Jan 30 '24 16:01 garlick

No, this is sounding appealing to me, but I'm afraid I still don't follow some points:

I guess the main appeal to me is that we wouldn't need to have a separate set of tools and rules for reservations like slurm.

I like this idea, but unfortunately don't have the mental capacity today to follow the reasoning. How would a reservation be requested? Would we just add a field to jobspec with an enforced start and end time, and only satisfy these requests from the instance owner? If a reservation is just a job that hasn't yet started, how would multiple jobs be submitted to a job in RESERVED state? It seems like these actions would require separate or missing tools we don't already have anyway.

In a way regular jobs then are just a degenerate case of a reservation anyway, where time in RESERVED is very short, so the plan doesn't introduce niche features that would have less testing than mainstream ones.

Ah, this is a good point. I had missed that all jobs would go through RESERVED (I had envisioned it as a one-off state). I do like this idea.

For example, maybe a job could request to start as soon as an initial resource request can be fulfilled, and also hold a reservation that would be added to the job later? Maybe we could also add a way for the scheduler to modify an already allocated R, such as replacing nodes that are no longer available, and we could make that work the same for running and reserved jobs.

I think this the general case of grow/shrink we've discussed before, and it doesn't seem like a RESERVED state is necessary to make this happen (at least we've never discussed it in that way) It seems like we were headed towards using resource-update events to manage that. (already we can update R using this approach)

grondo avatar Jan 30 '24 17:01 grondo

I didn't really say this clearly but yes, I was thinking some new jobspec attributes would be the way a job would request "reserved" resources. We already have a duration, so maybe attributes for start time and flags indicating whether start time is absolute or best effort, what to do if resources become unavailable before start time, etc.

If a reservation is just a job that hasn't yet started, how would multiple jobs be submitted to a job in RESERVED state? It seems like these actions would require separate or missing tools we don't already have anyway.

I was thinking in that case the RESERVED job would be a subinstance, but would only accept jobs after it starts for now. Hmm, maybe that's a stronger requirement than I thought.

I think this the general case of grow/shrink we've discussed before, and it doesn't seem like a RESERVED state is necessary to make this happen (at least we've never discussed it in that way) It seems like we were headed towards using resource-update events to manage that. (already we can update R using this approach)

I just meant that jobs with reserved resource allocations could benefit in a general way from grow, not necessarily help us get there.

garlick avatar Jan 30 '24 18:01 garlick

I was thinking in that case the RESERVED job would be a subinstance, but would only accept jobs after it starts for now. Hmm, maybe that's a stronger requirement than I thought.

Ah, I see. Forgive me, but do we need a separate state to handle this case then? For the purposes of all other tools the job would effectively be pending. I guess Flux could start a single rank instance (with the sole initial online rank excluded) to handle early job submission, but in principle that doesn't seem to require a new state. I worry that if the R is constantly evolving for a reserved allocation, then this would create a lot of traffic in the eventlog, whereas if we just keep the job in SCHED state until the allocation is granted we can just emit the actual R.

I really apologize because I feel like I'm missing the piece of the design that requires a new state. I am sure it is my fault and not yours.

grondo avatar Jan 30 '24 19:01 grondo

If we had a "reserved" state, possibly with either soft or hard semantics, we might also be able to use that to show it has been given a prospective start time by the scheduler. This is a bit of an idle thought while I'm in an OpenMP meeting, so it might not match super well, but if we could get both a nicer interface for DATs and have a way to surface predicted starts for jobs other than the next that would make users happy.

trws avatar Feb 01 '24 18:02 trws

Is this necessary if flux-framework/flux-sched#1015 is fixed? We do already have ephemeral "annotations" that do not potentially fill the eventlog with events to communicate this kind of data which could change with each schedule update.

OTOH, with the estimated starttime and resources for every job which is in the scheduler plan available, we could expose the scheduler's plan via some kind of visualization (kind of like OAR's Gantt drawing tool). Does even this, though, require a new job state? Could the planned resources for jobs be exposed in some other manner that doesn't require writing data to the KVS and an eventlog each time it changes? (Just throwing that question out there, I don't really know the answer)

grondo avatar Feb 01 '24 20:02 grondo

Also, would a RESERVED state also require a transition back to SCHED, e.g. if a new higher priority job is submitted, changing the schedule such that a RESERVED job no longer has any reserved resources in the current plan?

grondo avatar Feb 01 '24 20:02 grondo

A note from the meeting: Being to submit to a DAT/reservation before its starttime is a optional requirement for a minimum viable solution. I take that to mean we fulfill this requirement by being able to submit a job request that is guaranteed to be fulfilled at some time point in the future, with a way to launch a multi-user instance on those resources once allocated, including a way to restrict the set of users allowed to submit to that instance.

Assuming this is correct, I'll update the bullet list above with some missing items. I don't think this solution requires a new job state and all the changes that would come with it?

grondo avatar Feb 02 '24 16:02 grondo

I'd say let's hit the reset button on this discussion and start from the requirements. IOW let's drop the idea of RESERVED state and also of "regular jobs" having reservations and see what else we can come up with. If we need those ideas we can come back to them.

garlick avatar Feb 03 '24 02:02 garlick

On user restrictions: only the system instance currently loads the mf_priority plugin from flux-accounting, so we should think about how we would restrict users in a multi-user sub-instance.

A related question is whether we worry about proper accounting for users within that subinstance.

In RFC 33 we did define an access policy, so if we didn't want to load mf_priority in a subinstance, we could potentially generate a list of allowed users and pass it down in the subinstance policy config. (I think the access controls are not implemented yet but that would be trivial).

garlick avatar Feb 04 '24 16:02 garlick

In RFC 33 we did define an access policy, so if we didn't want to load mf_priority in a subinstance, we could potentially generate a list of allowed users and pass it down in the subinstance policy config. (I think the access controls are not implemented yet but that would be trivial).

Loading mf_priority in a subinstance seems like it would be a challenge. It currently assumes the flux-accounting service is loaded on rank 0, that rank 0 is on the same node as the accounting database, and it is trying to do system wide fair share on a portion of granted resources instead of within the DAT job itself (if that is even a thing). Also, we probably want to allow DAT/reservations to work without requiring flux-accouting, so I like the idea of access controls implemented by config stashed in the job's jobspec.

I'm also not sure how accounting for a subinstance would work. The subinstance jobs would not be going to the job archive or accounting archive, so we'd need some way to attribute usage, perhaps in an epilog or rc3 script when the DAT job is exiting? @ryanday36 - I assume we do currently account for jobs in DATs and reservations since Slurm only has one level of scheduling?

grondo avatar Feb 05 '24 17:02 grondo

That's correct. We do want to charge DAT usage to the users bank(s).

ryanday36 avatar Feb 05 '24 17:02 ryanday36

Is a DAT currently represented as a Queue, such that normal user jobs in that queue are accounted, or as a job where only that job is actually accounted that runs many job steps?

trws avatar Feb 05 '24 17:02 trws

For reference, here is a snippet of how Slurm accounts for reservations:

Jobs executed within a reservation are accounted for using the appropriate user and bank account. If resources within a reservation are not used, those resources will be accounted for as being used by all users or bank accounts associated with the reservation on an equal basis (e.g. if two users are eligible to use a reservation and neither does, each user will be reported to have used half of the reserved resources).

https://slurm.schedmd.com/reservations.html#account

grondo avatar Feb 05 '24 18:02 grondo

That begs a question for me. How often do we run into a DAT where it's composed of multiple banks rather than a single bank for the DAT? I admit I'd conceived of a dat as being a charged entity in and of itself which would be charged for at that level rather than the usage cost falling directly on the users that submitted work to it.

trws avatar Feb 05 '24 18:02 trws

Good question @trws. And if we need to use a bank/account to control access to a DAT job, then we would need some way to create the access control list from the bank when the job is started, or extend support for the mf_priority plugin for running in a subinstance. (Note also that the mf_priority plugin would just restrict users that can submit a job, not users that can use other instance services)

grondo avatar Feb 05 '24 18:02 grondo

An alternate solution for DATs was proposed in today's meeting.

The proposal AIUI was to offer an initially empty queue for use by DATs, then have the reservation assign resources to the queue dynamically by moving resource properties.

Using a queue in the system instance instead of a subinstance removes several issues noted above:

  • accounting will just work
  • users can submit jobs to the queue at any time (an adjustment to feasibility may be required)
  • ability to dynamically reassign resource properties already exists in Fluxion and is feasible in core

Open questions:

  • How to ensure queue configuration is propagated to other brokers in the system? e.g. the validator/frobnicator may use local broker configuration to update/validate a submitted job's queue
  • I'm not sure I caught the mechanism by which a reservation in the scheduler would be tied to a queue config in core
  • It sounded preferable to have tools to kick this process off, rather than requiring an update to configuration, reload config, etc. IOW, DATs aren't configuration so they shouldn't require modification of broker config to schedule one.

grondo avatar Dec 18 '24 23:12 grondo

For the record, @kkier voiced the opinion in that meeting that when flux restarts, any dynamically created queues should disappear as it would seem surprising otherwise. Not sure how that would affect the design above.

garlick avatar Jul 27 '25 19:07 garlick

That may need more thought. Should that also during an unintended restart, a restart due to an upgrade?

grondo avatar Jul 28 '25 15:07 grondo

Yes. Or flux config reload.

garlick avatar Jul 29 '25 03:07 garlick

To be clear, I'm 100% open to being convinced otherwise. My thinking is that I want to be able to look at the config files for a service and be able to predict the configuration that service will be in when it starts up. The mental model that came to mind was iptables - you can make whatever changes you want, but unless you e.g. iptables-save >> foo.txt and then load that file on boot, your changes will poof when you restart the process.

We don't currently (AFAIK) have a way to dump configs from a running instance for redirection to a file, correct?

kkier avatar Jul 29 '25 14:07 kkier

I was assuming a dynamically created queue would be "runtime state" not an actual change to the config.

If we persisted that particular runtime state in the KVS, then at startup we could first apply the configuration, then add the saved runtime state to get back where we were. I thought your point was that if you set queues in the config, then dammit, those are the queues you want to see at runtime after a restart, and no others :-)

garlick avatar Jul 29 '25 16:07 garlick

I think that we would want dynamically created queues to be persist through restarts / reconfigs if we're using them for DATs. In that context, they're more like jobs than configuration. I think that's the same as what Jim is saying about runtime state, but I wanted to weigh in too :)

ryanday36 avatar Jul 29 '25 17:07 ryanday36

FWIW Adam agrees with @ryanday36 and I'm convinced - if we want jobs to persist across restarts (we do) it makes sense for queues as well. It means we'd need some kind of reload clean/restart clean functionality to purposefully reload from just the config files and ignore the dynamic changes.

kkier avatar Jul 29 '25 21:07 kkier

Maybe it would be enough to just have

flux queue add
flux queue remove

where the remove command only works on queues that got there via flux queue add.

garlick avatar Jul 29 '25 21:07 garlick

We might also need a flux queue edit or something along those lines so that we can take nodes out of, for example, the batch queue when we create a new queue. Or maybe that could just be a flag on the flux queue add command (flux queue add --stealnodes ...)

ryanday36 avatar Jul 29 '25 22:07 ryanday36