flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

update handling of duration=0 to match RFC 14 specification

Open grondo opened this issue 2 years ago • 12 comments

A pending PR against RFC 14 (flux-framework/rfc#315) updates the required behavior of schedulers when jobspec duration is unset (duration=0.).

This new specification requires that schedulers propagate the expiration set in R upon resource acquisition into Rv1 fragments allocated to jobs when duration is not set. That is, if the jobspec does not indicate a duration, then at a minimum the currently acquired resource expiration should be propagated to the job's time limit.

Ideally, jobs requesting an explicit duration that would definitely exceed the current instance expiration should be rejected by the scheduler feasibility RPC, so that these jobs fail at submission time when the feasibility validator is enabled.

grondo avatar Mar 02 '22 18:03 grondo

Ideally, jobs requesting an explicit duration that would definitely exceed the current instance expiration should be rejected by the scheduler feasibility RPC, so that these jobs fail at submission time when the feasibility validator is enabled.

This no longer needs to be handled by the scheduler once flux-framework/flux-core#4224 is merged.

The example support for propagation of instance expiration in sched-simple can be seen in flux-framework/flux-core#4223

grondo avatar Mar 16 '22 03:03 grondo

@dongahn - this is another issue we may want to address near-term, so any pointers on where to start an implementation would be much appreciated.

Given that flux-core now has a jobspec duration validator, the only thing left to implement in Fluxion is for the rv1 emitter to propagate any expiration in resource.R into emitted Rv1 when the jobspec has specified duration = 0.

grondo avatar May 19 '22 17:05 grondo

Yup. I will take a look and add some thoughts.

dongahn avatar May 19 '22 17:05 dongahn

I will look at this more carefully later but per our discussion today. This is the configuration header that I used as a stand-in: https://github.com/flux-framework/flux-sched/blob/master/resource/config/system_defaults.hpp#L18

dongahn avatar May 19 '22 19:05 dongahn

R will be acquired by these points within sched-fluxion-resource: https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L1230 https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L1202

Then, the duration in R should be propagated to the resource infrastructure and get to https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl.hpp#L75

There are a few different ways you can pass the duration to that point. One way might be extend a public method with the traverser like run: https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu.cpp#L277

But there might be a simpler way that we can achieve this w/o changing the public method like that.

You may need a std::map somewhere keyed by the queue name, that holds the duration per each queue as well.

BTW, will there be a case where there is no instance duration set? (Probably the case of our system instance?)

dongahn avatar May 20 '22 05:05 dongahn

BTW, will there be a case where there is no instance duration set? (Probably the case of our system instance?)

An instance may have unlimited duration in at least the following cases:

  • system instance
  • instance started as a job under Flux or a foreign RM where the job has unlimited runtime
  • test instance started locally with flux-start

grondo avatar May 20 '22 16:05 grondo

Thanks -- these sound all compatible. Maybe this sounds like a reasonable first task for @JaeseungYeom to get his feet wet.

dongahn avatar May 20 '22 16:05 dongahn

Hi @dongahn I added flux_log() calls for debugging to the two code points you mentioned in sched-fluxion-resource, and those functions are called after I issue "flux start" but not when I submit a job via "flux mini alloc" or "flux mini run".

My set up process was: I built and installed flux-core and flux-sched with --prefix=$fubar, set PATH=$fubar/bin/:$PATH, ran flux start -s3, then "flux module list" to confirm fluxion is listed, then flux dmesg | grep "olaf" (a string in my log messages).

Is the problem my test process? Thanks!

ofaaland avatar Aug 01 '22 22:08 ofaaland

@ofaaland, your results might be expected if you only instrumented the resource acquire process mentioned above (resource_match.cpp). The process is roughly:

  1. flux-core resource module "discovers" resource from parent's R assigned to job
  2. fluxion module(s) start up and the fluxion-resource module calls the scheduler acquire protocol.
  3. flux-core resource module responds to the acquire request with the discovered R (which should have an expiration set for a job with a finite duration)

In step 3, the Fluxion resource module will need to store the instance expiration (somewhere, this is where I get lost) so that the Rv1 "writer" can use this value when the job itself does not have an assigned duration. I'm less sure of where Rv1 is generated for jobs, but perhaps @milroy could offer us a pointer.

grondo avatar Aug 01 '22 22:08 grondo

  1. flux-core resource module "discovers" resource from parent's R assigned to job
  2. fluxion module(s) start up and the fluxion-resource module calls the scheduler acquire protocol.
  3. flux-core resource module responds to the acquire request with the discovered R (which should have an expiration set for a job with a finite duration)

OK, I improved my instrumentation a bit using flux_log_error. The expiration received in the response to resource.acquire is reported on stdout.

invoked see expiration see rsmi_init() error
flux start yes yes
flux mini alloc yes yes
flux mini run no no

The rsmi_init() error is because I'm running on nodes without GPUs, so it's expected. But I mention it since it relates to resource discovery, and correlates with what I see in update_resource in response to the resource.acquire.

So flux mini run apparently does not go through this resource.acquire code path, unlike the two other cases. I'm working on figuring out why, but if you already know please enlighten me, thanks!

ofaaland avatar Aug 02 '22 21:08 ofaaland

Resources only need to be discovered when starting a new instance. Since flux mini run is only running a parallel job within an existing instance, not creating a new instance, resource discovery is not done. Note that even if an existing R is found for a job that is running a new instance (flux mini alloc, flux mini batch), the flux-core resource module still calls into libhwloc to verify resource availability.

grondo avatar Aug 02 '22 21:08 grondo

Resources only need to be discovered when starting a new instance.

That makes sense, thanks

ofaaland avatar Aug 02 '22 21:08 ofaaland

@grondo: I've made progress on this issue and think I'm close to posting a PR. I have a few questions before I can put it all together.

  1. Does a jobspec duration get converted to RV1 starttime and expiration? If so, where does that happen in flux-core?
  2. This resource.acquire line: https://github.com/flux-framework/flux-sched/blob/69d4061cee71915f08093430dd823115d22b96e2/resource/modules/resource_match.cpp#L1202 looks like it only pertains to updating an existing Fluxion instance and resource graph. Do we want to update a resource graph duration? If so, it looks like that RPC won't contain any information for updating the graph duration.
  3. Is it better to create a function overload to perform a small additional task (unpacking starttime and expiration) thus creating a lot of duplicate code, or is it better to change the function signature? In this case changing the function signature will result in an output variable that's not used by some callers.

milroy avatar Sep 15 '22 17:09 milroy

Does a jobspec duration get converted to RV1 starttime and expiration? If so, where does that happen in flux-core?

The scheduler is responsible for adding RV1 starttime and expiration. This is done by sched-simple in flux-core here:

static char *Rstring_create (struct simple_sched *ss,
                             struct rlist *l,
                             double now,
                             double timelimit)
{
    char *s = NULL;
    json_t *R = NULL;
    if (timelimit > 0.) {
        l->starttime = now;
        l->expiration = now + timelimit;
    }
    else if (ss->rlist->expiration > 0.) {
        l->starttime = now;
        l->expiration = ss->rlist->expiration;
    }
    if ((R = rlist_to_R (l))) {
        s = json_dumps (R, JSON_COMPACT);
        json_decref (R);
    }
    return s;
}

Do we want to update a resource graph duration? If so, it looks like that RPC won't contain any information for updating the graph duration.

Yes, updated expiration is not supported in RFC 28/Resource Acquisition Protocol. That can be left as a TODO item for when we add support for extending job timelimits.

In this case changing the function signature will result in an output variable that's not used by some callers.

That is probably a style question and up to you, but I would tend to avoid duplicated code almost every time.

grondo avatar Sep 15 '22 17:09 grondo