flux-sched
flux-sched copied to clipboard
Deferred allocation
Problem: support a scheduling request for an allocation to occur at a specific time in the future.
Currently, a reservation of resources occurs as early as possible. However, for supporting workflows that benefit from running tasks across heterogeneous platforms, it is desired to synchronize multiple allocations across different child instances. Such that task 1-10 run on corona while task 11-20 "simultaneously" run on another cluster managed by Flux. To support such use cases, two things are needed. One is the deferred allocation capability, and the other is a means to query the allocation delay. A parent instance can query its remote child instances to find out when is the earliest by which all the children can allocate requested resources. Then, it should be possible to allocate synchronously across instances.
Pushing the reservation time back should also consider back-filing. To be clear, this is not the same as to try allocating at the earliest after a specific point in time. I am not entirely sure if the existing issue #963 is the latter case or the same as this.
A parent instance can query its remote child instances to find out when is the earliest by which all the children can allocate requested resources. Then, it should be possible to allocate synchronously across instances.
To make sure I understand the basics (without getting into too much complexity yet) of the deferred allocation capability, this is a three-part process:
- submit the jobspecs to all child instances with a new
match_reserve (jobspec)
request that reserves the requested resources on each child instance at the earliest time possible and return those times. - Find the latest time returned (T), and for all the child instances that returned earlier times, issue a new
match_reserve_at (jobspec, T)
which moves the reservation back to time T. - Handle the case where one or more children can't satisfy
match_reserve_at (jobspec, T)
.
Is that basically correct?
I've confirmed that by manipulating the at
time in dfu_traverser_t::run
: https://github.com/flux-framework/flux-sched/blob/35d3c96bb7378ec4f6803a7e340043fd4040e8c1/resource/traversers/dfu.cpp#L277 we can achieve the desired behavior. Here I've simulated this by hardcoding at = 3600
in dfu_traverser_t::run
and performing a match allocate
:
resource-query> match allocate t/data/resource/jobspecs/basics/test001.yaml
---------------core35[1:x]
------------socket1[1:x]
---------node1[1:s]
------rack0[1:s]
---tiny0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=RESERVED
INFO: SCHEDULED AT=3600
INFO: =============================
Of course, there will be a decent amount of development required to add new match_op_t
cases and determine the best way to include the desired time in the jobspec.
Of course, there will be a decent amount of development required to add new match_op_t cases
Actually that is not complicated.
and determine the best way to include the desired time in the jobspec.
As discussed with @grondo during last week's team meeting, we still need to decide how to proceed with this part. The current state of PR #1013 uses the optional system
key space to let users set the deferred time. To ensure the allocation doesn't get moved up (which is undesired) or moved back for each match allocate_orelse_reserve
, I added code to use a base time (deferred_from
in epoch seconds) which makes deferred_start
a relative time: https://github.com/flux-framework/flux-sched/blob/90f822967c30ec6d3d692e85bd39241a516c2abd/resource/traversers/dfu_impl.hpp#L104
An example test jobspec looks like this:
version: 9999
resources:
- type: cluster
count: 1
with:
- type: rack
count: 1
with:
- type: node
count: 1
with:
- type: slot
count: 1
label: default
with:
- type: socket
count: 1
with:
- type: core
count: 1
# a comment
attributes:
system:
duration: 3600
# optional deferred keys
deferred_start: 1800
deferred_from: 0
tasks:
- command: [ "app" ]
slot: default
count:
per_slot: 1
My sense is that while this may work well for automated submission it will be hard for manual submission. @jameshcorbett and @ryanday36 might have good input here.
The problem is that you need to be able to define those attributes without writing a yaml file every time?
We are working on a shape spec for resources - https://github.com/flux-framework/rfc/pull/371 maybe we need the same for system attributes? Ping @trws
Would the submit time (called t_submit
in qmanager) work as the deferred_from value?
The problem is that you need to be able to define those attributes without writing a yaml file every time?
There is already a facility for specifying system attributes on the command line of the submission commands (See documentation of --setattr
in e.g. flux-run(1))
Would the submit time (called t_submit in qmanager) work as the deferred_from value?
That is a great idea. I was going to suggest something similar in that t_submit
could be the default if deferred_from
is not set (in case allowing a different deferred_from
is useful in testing?)
I think that t_submit
probably makes sense for a default deferred_from
value. I'm not quite clear, does the current implementation allow the user to set an absolute time, or just a relative time? It seems like the best interface for users would allow them to say something like --setattr=deferred_start=3pm
or --setattr=deferred_start=+2.1h
(i.e. take the same datetime formats as the current --begin-time
flag.
I was also thinking more about what keyword would make sense for this. I'm leaning toward something more like 'reserve_time' or 'reserve_start', or maybe 'require_start' since it will raise an exception on the job if it can't start at that time.
The --begin-time
option uses a timestamp (absolute time) which is obtained by parsing the user's argument with our Python parse_datetime()
function:
--begin-time=DATETIME
Convenience option for setting a begin-time dependency for a
job. The job is guaranteed to start after the specified date
and time. If DATETIME begins with a + character, then the
remainder is considered to be an offset in Flux standard dura‐
tion (RFC 23), otherwise, any datetime expression accepted by
the Python parsedatetime module is accepted, e.g. 2021-06-21
8am, in an hour, tomorrow morning, etc.
It would be nice to support something similar here.
If we can add whatever option we call this to the jobspec RFC, then perhaps it would make sense to expose this as a similar option in the submission commands?
Or, would it be too kludgy to add some kind of sentinel to --begin-time
to make it set this option in jobspec instead of a dependency? (e.g. --begin-time=force:3pm
) Meh, just throwing that out there. Simple enough and probably clearer to add a --require-start=3pm
option. Still, if we are exposing an option in the core submission commands, we should have the resulting jobspec properties documented in the RFC.
@grondo why should we require users to figure out timestamps / timezones? Isn't it easier (or minimally should be an option) to provide relative times? E.g., what if you are doing some kind of flux proxy to an instance in a different timezone and then you get it wrong (or minimally have to convert which is a hairball I don't think we want to dive into).
A suggestion - if begin time is already a thing (and indeed it's actually a time to begin) why not have a --start
that provides the same but is relative? E.g., --start=60
(start in an hour) and then I don't have to think about actual times (thank goodness!)
Reference for time pain: https://gist.github.com/timvisee/fcda9bbdff88d45cc9061606b4b923ca :timer_clock: :scream:
I'm confused. As shown above, the interface does not require users to actually specify the timestamp. The begin time can be specified as an offsite or absolute time or any other format supported by parsedatetime
.
Oh I see, if you add + it is an offset? Sorry I'm just really stupid.
I'll just see myself out, I'm not really helping anyone.
I think I'm having one of those days myself FWIW.
I think that t_submit probably makes sense for a default deferred_from value. I'm not quite clear, does the current implementation allow the user to set an absolute time, or just a relative time?
I didn't know about t_submit
and that does sound like the right default choice.
I just realized I obfuscated a crucial detail with deferred_from: 0
in my example jobspec above. That value is the epoch time in seconds. Here's how it's used in the PR currently: https://github.com/flux-framework/flux-sched/pull/1013/commits/90f822967c30ec6d3d692e85bd39241a516c2abd.
I could certainly implement what @grondo suggested from the --begin-time
option.