flux-core WIP: support `resource.reserve` configuration parameter

This WIP PR addresses #6315 and #5240 by adding a new config key resource.exclude, which allows specification of a set of cores to hold back from the scheduler along with optional ranks on which to reserve them. Cores are specified in the form:

cores[@ranks] [cores[@ranks]]...

Where cores and the optional ranks are idsets. If ranks is not specified, then the same cores are reserved on all ranks. To reserve different sets of cores on different ranks, multiple specifications may be used separated by whitespace. For example, 0 would withold core 0 on all ranks, 0-1@0 0 would reserve 0-1 on rank 0 and 0 on all other ranks, etc.

Marking this as a WIP because it is just one possible approach. We could also easily extend the resource.exclude config key to allow a similar syntax. There are possibly some drawbacks here:

exclude is encoded as an idset in the resource.status response - we'd have to always condense the excluded resource set into an idset for backwards compatibility.
exclude allows specification of a hostlist, which doesn't make as much sense for resource reservation (though it could be easily handled.)
The common case for resource.exclude is perhaps different than resource.reserve -- e.g. reserving the same set of cores across all nodes of a job would be slightly more awkward if only resource.exclude syntax was supported (since specification of ranks would be obligatory)

However, it does seem awkward to have two config keys that do similar things.

The other question here is how to present the reserved resources in either flux resource status (which currently does not include sub-rank resources in default output anyway) or flux resource list.

Jan 10 '25 16:01 grondo

This seems workable to me.

I can't really think of a case where a system instance would want to reserve different cores on different node types and run into awkwardness specifying those as ranks. If that ever does come up then hostlist support might be nice.

Jan 21 '25 18:01 garlick

It would be good to get feedback from @ryanday36 if this is the correct approach for what's needed in the near term. I'm not sure if the requirements are to allow users to reserve cores for the system in their batch jobs, while the default is not to do so, or if the need is to reserve cores in the system instance, then give users an option to opt-out.

I'd hate to add a new config key (and whole new way to select resources) if it isn't going to end up useful.

Jan 21 '25 20:01 grondo

My understanding for what we want with the "OS" or "system" cores is to have no user tasks run on the system cores unless the user explicitly asks for them. So, this resource.reserve approach would work as long as there is also some sort of --all-resources flag to the flux run type commands that would allow the users to specify that they want to run on reserved resources.

Jan 21 '25 22:01 ryanday36

So, this resource.reserve approach would work as long as there is also some sort of --all-resources flag to the flux run type commands that would allow the users to specify that they want to run on reserved resources.

In this case the resources are excluded from the scheduler so there is no way to ask for them, except perhaps by launching a new instance (i.e. via either flux batch or flux alloc), setting the resource.rediscover option for that instance, then using flux run or flux submit in the new instance.

@garlick's idea of a partition might be slightly more suitable, except that, AIUI, the proposed partitions are not allowed to overlap so you would not be able to run a job across all partitions. That may be easily resolved however.

In any event, I think this requirement implies that the scheduler needs to know about reserved resources specifically, so withholding them like exclude won't work.

Jan 22 '25 15:01 grondo

except perhaps by launching a new instance (i.e. via either flux batch or flux alloc), setting the resource.rediscover option for that instance, then using flux run or flux submit in the new instance.

Would that be all that bad? I think we decided the partition idea was going to take significantly more effort.

Jan 22 '25 16:01 garlick

In this case the resources are excluded from the scheduler so there is no way to ask for them, except perhaps by launching a new instance (i.e. via either flux batch or flux alloc), setting the resource.rediscover option for that instance, then using flux run or flux submit in the new instance.

Having users get a new instance with resource.rediscover would probably work. I can't actually think of a use case where a users would need some flux run to see the cores and others not see the cores, and if one did come up, it would probably be a pretty savvy user. Those users could get their instance with resource.rediscover and then manage things themselves.

Jan 22 '25 16:01 ryanday36

Well, perhaps this is ready for a review then. Is the current interface ok? Happy to take suggestions on improvements.

Jan 23 '25 17:01 grondo

Codecov Report

Attention: Patch coverage is 23.84106% with 115 lines in your changes missing coverage. Please review.

Project coverage is 79.37%. Comparing base (fb2f0ac) to head (6f88d30).

Files with missing lines	Patch %	Lines
src/common/librlist/corespec.c	0.00%	97 Missing :warning:
src/modules/resource/reserve.c	55.55%	12 Missing :warning:
src/common/librlist/rlist.c	81.81%	2 Missing :warning:
src/modules/resource/status.c	60.00%	2 Missing :warning:
src/modules/resource/acquire.c	83.33%	1 Missing :warning:
src/modules/resource/resource.c	80.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6547      +/-   ##
==========================================
- Coverage   79.47%   79.37%   -0.11%     
==========================================
  Files         531      533       +2     
  Lines       88433    88573     +140     
==========================================
+ Hits        70282    70302      +20     
- Misses      18151    18271     +120

Files with missing lines	Coverage Δ
src/cmd/flux-resource.py	`95.04% <ø> (ø)`
src/modules/resource/acquire.c	`60.24% <83.33%> (+0.24%)`	:arrow_up:
src/modules/resource/resource.c	`86.36% <80.00%> (-0.17%)`	:arrow_down:
src/common/librlist/rlist.c	`82.04% <81.81%> (+0.03%)`	:arrow_up:
src/modules/resource/status.c	`74.09% <60.00%> (-0.33%)`	:arrow_down:
src/modules/resource/reserve.c	`55.55% <55.55%> (ø)`
src/common/librlist/corespec.c	`0.00% <0.00%> (ø)`

... and 6 files with indirect coverage changes

Jan 28 '25 19:01 codecov[bot]