flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

[discussion] grow support

Open dongahn opened this issue 5 years ago • 5 comments

Spins off from flux-framework/rfc#230. There is an idea of adding an additional R into JOBID schema to support "grow" in the same way as "shrink"

A key question:

Then the complementary "grow" directory could contain chunks that are added. While we are here, maybe we can also has this out a bit as this is what @milroy will soon need.

I don't think adding an additional R is difficult. But what is currently difficult would be how to do this under the original JOBID. In particular, flux job submit will always generate a new JOBID. Do you think there is an easy path to to generate a new R under the same JOBID using flux job submit|flux mini interface?

dongahn avatar Mar 01 '20 18:03 dongahn

When we discussed this before, I think a new "grow" interface was proposed but maybe that is out of scope for RFC 20?

garlick avatar Mar 01 '20 21:03 garlick

Yeah something like flux grow or similar can be a solution. I will change this to flux core issue.

dongahn avatar Mar 02 '20 16:03 dongahn

Here's a summary of the coffee hour discussion yesterday with @dongahn, @garlick, @grondo, and @SteVwonder. Please add details I left out here.

While having the ability to unify an existing R and a new R' under the same JOBID is desirable, this can't be accomplished in the near term. @garlick noted that it will be particularly tricky to wire new subtrees into an existing TBON. Since the immediate application for the grow functionality is to test Kubernetes jobs, bypassing the broker network issue with upcoming broker-less functionality was discussed. That capability won't be ready in the short term. @dongahn proposed the idea of a grow operation as a checkpoint restart (which is needed anyway in flux-sched), where the current job is checkpointed and then restarted with additional resources (along the lines of https://github.com/flux-framework/flux-sched/issues/470). This may be useful for future grow investigation.

Two related interim solutions were proposed which satisfy grow by a new job submission (and new JOBID). The first is to use the URI of the new job to exec Kubernetes commands on the new job's resources, and the other is simply to put those commands into a job script. With appropriate labels and scale ranges Kubernetes should handle starting containers on the new resources. One item to keep in mind is that the new resources should have an end time equal to the original job. That will prevent users from having interminable Kubernetes jobs by packing the head of the flux-sched queue with grow requests.

milroy avatar Mar 05 '20 07:03 milroy

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.

stale[bot] avatar Mar 05 '21 20:03 stale[bot]

Is this still under discussion for our elasticity work? I remember more recent discussion about grow, but in the context of flux-sched. A linked issue (still open) for flux-core is https://github.com/flux-framework/flux-core/issues/2802. For flux-sched, for issues/PRs I'm finding:

  • https://github.com/flux-framework/flux-sched/pull/989 (this is the one I remember from this year)
  • https://github.com/flux-framework/flux-sched/issues/767

And then there are a bunch with an elasticity label but I think a few years old: https://github.com/flux-framework/flux-sched/issues?q=label%3Aelasticity+. I'm not up to date on anything really (and apologies for that, but I know this is important) so let's make sure we sync all of these into some cohesive next step(s) if it's still important. I'd like to understand, for example, how the recent flux-sched 989 might help with #2802.

vsoch avatar Dec 18 '23 03:12 vsoch