argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Gang Scheduling

Open vicaire opened this issue 7 years ago • 5 comments

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

Hi,

Does Argo plan to support Gang Scheduling? Let me detail what I mean by this.

Assume that someone would like to run a TensorFlow distributed training algorithm. This training algorithm needs 5 containers to be up and running before it can start.

Now imagine that various users would start 6 different instances of this workflow in parallel.

It is now possible that each workflow would start 3 containers, but none of the workflows would be able to start all 5 containers because the Kubernetes cluster would not have enough resources.

As a result, all the workflows are stuck.

Does Argo plan to provide a feature so that a workflow can start 5 containers if and only if it is able to get all five of them? Is there a way to circumvent this issue using Argo today? Maybe the workflow could release resources if it is not able to start all 5 containers within, say, 20 seconds. It would then retry later on using an exponential backoff strategy.

vicaire avatar Feb 12 '18 01:02 vicaire

Theres been some discussion about building controller level scheduling and admission control directly into the controller. I filed https://github.com/argoproj/argo/issues/740 to capture some of the previous discussion around this. Workflow scheduling is something that up to this point, we have wanted to defer to either kubernetes, or a higher level application, since it is a complicated feature and there can be many scheduling algorithms that suit different needs.

I think before we could even reach a point of gang scheduling, we would first need to support basic queuing and prioritization, which is being tracked in issue #740.

I also believe gang scheduling might be possible to be achieved in an indirect way (with minimal Argo support) by using the Kubernetes 1.9 feature for pod priority and preemption. If argo were to support simple pass-through of the priorityClassName flag from the workflow spec to the pod spec, then the workflows might get gang-like scheduling for free.

But all of this is to say, we are a long ways off for being able to implement something advanced like gang scheduling.

jessesuen avatar Feb 14 '18 00:02 jessesuen

Got it. Thanks Jesse!

vicaire avatar Feb 14 '18 05:02 vicaire

@jessesuen is there any current plans / timelines for gang scheduling?

d4l3k avatar Jun 17 '21 17:06 d4l3k

Hey ! Are there any plans for supporting gang scheduling?

valayDave avatar Dec 21 '23 08:12 valayDave

https://platformengineering.org/blog/kubernetes-1-35-10-new-alpha-features 🫢

Ryang20718 avatar Dec 05 '25 00:12 Ryang20718