kueue icon indicating copy to clipboard operation
kueue copied to clipboard

Support kubeflow operator

Open xiaoxubeii opened this issue 2 years ago • 3 comments

What would you like to be added: Support kubeflow training operator.

Why is this needed: It is to track the status of kueue to support kubeflow training operator.

  • [ ] https://github.com/kubeflow/common/pull/196
  • [ ] #65

xiaoxubeii avatar Jul 14 '22 00:07 xiaoxubeii

Note that MPIJob latest version is not currently part of the training-operator https://github.com/kubeflow/training-operator/issues/1479

alculquicondor avatar Jul 14 '22 13:07 alculquicondor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 12 '22 13:10 k8s-triage-robot

/lifecycle frozen

alculquicondor avatar Oct 12 '22 15:10 alculquicondor

This is currently blocked on https://github.com/kubeflow/common/pull/196

alculquicondor avatar Mar 20 '23 13:03 alculquicondor

/assign

tenzen-y avatar Jul 05 '23 19:07 tenzen-y

As a first step, I opened a PR to add the PyTorchJob support, and then I will add the following framework support:

  • TFJob
  • MXJob
  • XGboostJob
  • PaddleJob

Also, I'm on the fence if we should support MPIJob v1 hosted only on kubeflow/training-operator (currently, MPIJob v2 hosted only on kubeflow/mpi-operator)

Regarding MPIJob v1 wdyt? @alculquicondor @mimowo @kerthcet @trasc

tenzen-y avatar Jul 19 '23 07:07 tenzen-y

I'm ok leaving it out if it's not trivial to support 2 API versions. I think the CRD objects themselves are not compatible.

alculquicondor avatar Jul 19 '23 14:07 alculquicondor

I think the CRD objects themselves are not compatible.

Right.

I'm ok leaving it out if it's not trivial to support 2 API versions.

We can not support v1 and v2 API by a single controller: https://github.com/kubernetes-sigs/kueue/tree/a103723023aa6c5a63cc8c1248fd38d8640d7003/pkg/controller/jobs/mpijob.

However, once we implement a separate controller for v1 like https://github.com/kubernetes-sigs/kueue/blob/3589969054023cb8b584a4639f4b9dec8c371a67/pkg/controller/jobs/kubeflow/jobs/pytorchjob/pytorchjob_controller.go, we can support v1.

tenzen-y avatar Jul 19 '23 15:07 tenzen-y

Anyway, I think MPIJob v1 is a lower priority since we already support MPIJob v2.

tenzen-y avatar Jul 19 '23 15:07 tenzen-y

+1 to defer the work unless we receive strong demands.

kerthcet avatar Jul 20 '23 03:07 kerthcet

+1 to defer the work unless we receive strong demands.

I agree.

tenzen-y avatar Jul 20 '23 05:07 tenzen-y

Tasks:

  • [x] PyTorchJob: https://github.com/kubernetes-sigs/kueue/pull/995
  • [x] TFJob: https://github.com/kubernetes-sigs/kueue/pull/1068
  • [x] XGBoostJob: https://github.com/kubernetes-sigs/kueue/pull/1114
  • [x] MXJob: https://github.com/kubernetes-sigs/kueue/pull/1183
  • [x] PaddleJob: https://github.com/kubernetes-sigs/kueue/pull/1142

tenzen-y avatar Aug 16 '23 20:08 tenzen-y