flux-sched
flux-sched copied to clipboard
update handling of duration=0 to match RFC 14 specification
A pending PR against RFC 14 (flux-framework/rfc#315) updates the required behavior of schedulers when jobspec duration
is unset (duration=0.
).
This new specification requires that schedulers propagate the expiration set in R upon resource acquisition into Rv1 fragments allocated to jobs when duration is not set. That is, if the jobspec does not indicate a duration, then at a minimum the currently acquired resource expiration should be propagated to the job's time limit.
Ideally, jobs requesting an explicit duration that would definitely exceed the current instance expiration should be rejected by the scheduler feasibility
RPC, so that these jobs fail at submission time when the feasibility
validator is enabled.
Ideally, jobs requesting an explicit duration that would definitely exceed the current instance expiration should be rejected by the scheduler feasibility RPC, so that these jobs fail at submission time when the feasibility validator is enabled.
This no longer needs to be handled by the scheduler once flux-framework/flux-core#4224 is merged.
The example support for propagation of instance expiration in sched-simple can be seen in flux-framework/flux-core#4223
@dongahn - this is another issue we may want to address near-term, so any pointers on where to start an implementation would be much appreciated.
Given that flux-core now has a jobspec duration validator, the only thing left to implement in Fluxion is for the rv1 emitter to propagate any expiration
in resource.R
into emitted Rv1 when the jobspec has specified duration = 0
.
Yup. I will take a look and add some thoughts.
I will look at this more carefully later but per our discussion today. This is the configuration header that I used as a stand-in: https://github.com/flux-framework/flux-sched/blob/master/resource/config/system_defaults.hpp#L18
R will be acquired by these points within sched-fluxion-resource: https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L1230 https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L1202
Then, the duration in R should be propagated to the resource infrastructure and get to https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl.hpp#L75
There are a few different ways you can pass the duration to that point. One way might be extend a public method with the traverser like run
: https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu.cpp#L277
But there might be a simpler way that we can achieve this w/o changing the public method like that.
You may need a std::map
somewhere keyed by the queue name, that holds the duration per each queue as well.
BTW, will there be a case where there is no instance duration set? (Probably the case of our system instance?)
BTW, will there be a case where there is no instance duration set? (Probably the case of our system instance?)
An instance may have unlimited duration in at least the following cases:
- system instance
- instance started as a job under Flux or a foreign RM where the job has unlimited runtime
- test instance started locally with flux-start
Thanks -- these sound all compatible. Maybe this sounds like a reasonable first task for @JaeseungYeom to get his feet wet.
Hi @dongahn I added flux_log() calls for debugging to the two code points you mentioned in sched-fluxion-resource, and those functions are called after I issue "flux start" but not when I submit a job via "flux mini alloc" or "flux mini run".
My set up process was: I built and installed flux-core and flux-sched with --prefix=$fubar, set PATH=$fubar/bin/:$PATH, ran flux start -s3, then "flux module list" to confirm fluxion is listed, then flux dmesg | grep "olaf" (a string in my log messages).
Is the problem my test process? Thanks!
@ofaaland, your results might be expected if you only instrumented the resource acquire process mentioned above (resource_match.cpp
). The process is roughly:
- flux-core
resource
module "discovers" resource from parent's R assigned to job - fluxion module(s) start up and the fluxion-resource module calls the scheduler
acquire
protocol. - flux-core
resource
module responds to the acquire request with the discovered R (which should have anexpiration
set for a job with a finite duration)
In step 3, the Fluxion resource module will need to store the instance expiration (somewhere, this is where I get lost) so that the Rv1 "writer" can use this value when the job itself does not have an assigned duration. I'm less sure of where Rv1 is generated for jobs, but perhaps @milroy could offer us a pointer.
- flux-core resource module "discovers" resource from parent's R assigned to job
- fluxion module(s) start up and the fluxion-resource module calls the scheduler acquire protocol.
- flux-core resource module responds to the acquire request with the discovered R (which should have an expiration set for a job with a finite duration)
OK, I improved my instrumentation a bit using flux_log_error
. The expiration received in the response to resource.acquire is reported on stdout.
invoked | see expiration | see rsmi_init() error |
---|---|---|
flux start | yes | yes |
flux mini alloc | yes | yes |
flux mini run | no | no |
The rsmi_init() error is because I'm running on nodes without GPUs, so it's expected. But I mention it since it relates to resource discovery, and correlates with what I see in update_resource
in response to the resource.acquire.
So flux mini run
apparently does not go through this resource.acquire code path, unlike the two other cases. I'm working on figuring out why, but if you already know please enlighten me, thanks!
Resources only need to be discovered when starting a new instance. Since flux mini run
is only running a parallel job within an existing instance, not creating a new instance, resource discovery is not done. Note that even if an existing R is found for a job that is running a new instance (flux mini alloc
, flux mini batch
), the flux-core resource
module still calls into libhwloc to verify resource availability.
Resources only need to be discovered when starting a new instance.
That makes sense, thanks
@grondo: I've made progress on this issue and think I'm close to posting a PR. I have a few questions before I can put it all together.
- Does a jobspec
duration
get converted to RV1starttime
andexpiration
? If so, where does that happen in flux-core? - This
resource.acquire
line: https://github.com/flux-framework/flux-sched/blob/69d4061cee71915f08093430dd823115d22b96e2/resource/modules/resource_match.cpp#L1202 looks like it only pertains to updating an existing Fluxion instance and resource graph. Do we want toupdate
a resource graph duration? If so, it looks like that RPC won't contain any information for updating the graph duration. - Is it better to create a function overload to perform a small additional task (unpacking
starttime
andexpiration
) thus creating a lot of duplicate code, or is it better to change the function signature? In this case changing the function signature will result in an output variable that's not used by some callers.
Does a jobspec duration get converted to RV1 starttime and expiration? If so, where does that happen in flux-core?
The scheduler is responsible for adding RV1 starttime and expiration. This is done by sched-simple
in flux-core here:
static char *Rstring_create (struct simple_sched *ss,
struct rlist *l,
double now,
double timelimit)
{
char *s = NULL;
json_t *R = NULL;
if (timelimit > 0.) {
l->starttime = now;
l->expiration = now + timelimit;
}
else if (ss->rlist->expiration > 0.) {
l->starttime = now;
l->expiration = ss->rlist->expiration;
}
if ((R = rlist_to_R (l))) {
s = json_dumps (R, JSON_COMPACT);
json_decref (R);
}
return s;
}
Do we want to update a resource graph duration? If so, it looks like that RPC won't contain any information for updating the graph duration.
Yes, updated expiration is not supported in RFC 28/Resource Acquisition Protocol. That can be left as a TODO item for when we add support for extending job timelimits.
In this case changing the function signature will result in an output variable that's not used by some callers.
That is probably a style question and up to you, but I would tend to avoid duplicated code almost every time.