training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Support richer volcano scheduling

Open shaoqingyang opened this issue 1 year ago โ€ข 6 comments

What you would like to be added?

The current training operators, such as TFJob, cannot set queues and priorities, which can be achieved through annotations or other forms.

Why is this needed?

I need this to train my data.

Love this feature?

Give it a ๐Ÿ‘ We prioritize the features with most ๐Ÿ‘

shaoqingyang avatar Jul 24 '24 03:07 shaoqingyang

Thanks for creating this @shaoqingyang! I think, we have this APIs to specify queue and priorities to integrate with volcano scheduler: https://github.com/kubeflow/training-operator/blob/master/pkg/apis/kubeflow.org/v1/common_types.go#L231

Won't it work for you ?

cc @lowang-bh

andreyvelich avatar Jul 30 '24 17:07 andreyvelich

I have a question about https://www.kubeflow.org/docs/components/training/user-guides/job-scheduling/

For volcano and scheduler plugin, we need to configure the training-operator with:

...
    spec:
      containers:
        - command:
            - /manager
+           - --gang-scheduler-name=volcano
          image: kubeflow/training-operator
          name: training-operator
...

But when it comes to Kueue, we only need to specify the label in the metadata without modifying the configuration of training-operator, which is more simple and user-friendly:

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

May I ask why we didn't implement a unified scheduling framework for these three schedulers? What prevents us from implementing such a unified scheduling framework?

Also, current ManagedBy field in RunPolicy only supports kubeflow.org/training-operator and kueue.x-k8s.io/multikueue. Maybe it will make users puzzled and think we only support kueue(or just for me)?

PTAL if you have time๐Ÿ‘€ @kubeflow/wg-training-leads

Electronic-Waste avatar Sep 30 '24 03:09 Electronic-Waste

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 29 '24 05:12 github-actions[bot]

/remove-lifecycle stale

Electronic-Waste avatar Dec 29 '24 15:12 Electronic-Waste

I have a question about https://www.kubeflow.org/docs/components/training/user-guides/job-scheduling/

For volcano and scheduler plugin, we need to configure the training-operator with:

...
    spec:
      containers:
        - command:
            - /manager
+           - --gang-scheduler-name=volcano
          image: kubeflow/training-operator
          name: training-operator
...

But when it comes to Kueue, we only need to specify the label in the metadata without modifying the configuration of training-operator, which is more simple and user-friendly:

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

May I ask why we didn't implement a unified scheduling framework for these three schedulers? What prevents us from implementing such a unified scheduling framework?

Also, current ManagedBy field in RunPolicy only supports kubeflow.org/training-operator and kueue.x-k8s.io/multikueue. Maybe it will make users puzzled and think we only support kueue(or just for me)?

PTAL if you have time๐Ÿ‘€ @kubeflow/wg-training-leads

Actually kueue is a job and queue management system instead of a scheduler, it is usually used in conjunction with kube-scheduler, so I think that for kueue integration with kubeflow, it only needs queue related annotation.

Monokaix avatar Feb 28 '25 08:02 Monokaix

@Monokaix Thanks for your clarification!

Electronic-Waste avatar Feb 28 '25 08:02 Electronic-Waste

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 29 '25 10:05 github-actions[bot]

still need.

Monokaix avatar May 29 '25 10:05 Monokaix

/remove-lifecycle stale

@Monokaix Will be supported in #2437

Electronic-Waste avatar May 29 '25 12:05 Electronic-Waste

Closing this one and let us track this feature by #2437 /close

tenzen-y avatar May 29 '25 13:05 tenzen-y

@tenzen-y: Closing this issue.

In response to this:

Closing this one and let us track this feature by #2437 /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar May 29 '25 13:05 google-oss-prow[bot]