argo-workflows
argo-workflows copied to clipboard
Integration with Kueue
Summary
Argo Workflows needs to implement necessary suspend mechanism to work with Kueue. See https://github.com/kubernetes-sigs/kueue/issues/74 for more details.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
Also relevant https://github.com/kubernetes/kubernetes/issues/121681
I read the above two issues and I'm not sure what the next step would be for Argo here.
The current Workflow suspend
spec is very similar to Job's suspend
spec, so if that suffices for an API, I'm not sure what other changes would be needed.
The readinessGate
is listed as an "alternative" proposal and still being refined upstream, so if that's not necessary, I don't quite see what's currently missing in the Argo spec. Can someone elaborate?
Maybe we can define a layer suspend mechanism (between workflow and steps) and estimate the total resources for next layer. When our quota is reserved, we will perform a one-layer resume.
I think the next step is to take a step back to understand the following:
- what are the possible integration points?
- what exactly to suspend (the entire workflow, the layer?).
- if layers, should we enqueue each layer only when they are ready to run, or should they be enqueued at the beginning, but not start until dependencies are met? This is where https://github.com/kubernetes/kubernetes/issues/121681 would help, if a layer is a k8s Job or it has an equivalent representation.
Note that Kueue works best when there is a CRD that represents the unit of queueuing.
Note that Kueue works best when there is a CRD that represents the unit of queueuing.
I am thinking about this too. If there is no CRD that represents the unit of queueuing for every step, we may need to suspend the whole argo workflow, which is hard to estimate the resources needed. I prefer to let users to choose when to suspend the workflow by adding a suspend template as they do now.
Maybe add a property to indicate required podsets to run when suspended is enough. In this way, when workflow is suspended, we can create a workload with the required podsets in workflow status and resume the workflow when workload is admitted.
Proposal is available for review https://github.com/kubernetes-sigs/kueue/pull/2976