Support richer volcano scheduling
What you would like to be added?
The current training operators, such as TFJob, cannot set queues and priorities, which can be achieved through annotations or other forms.
Why is this needed?
I need this to train my data.
Love this feature?
Give it a ๐ We prioritize the features with most ๐
Thanks for creating this @shaoqingyang! I think, we have this APIs to specify queue and priorities to integrate with volcano scheduler: https://github.com/kubeflow/training-operator/blob/master/pkg/apis/kubeflow.org/v1/common_types.go#L231
Won't it work for you ?
cc @lowang-bh
I have a question about https://www.kubeflow.org/docs/components/training/user-guides/job-scheduling/
For volcano and scheduler plugin, we need to configure the training-operator with:
...
spec:
containers:
- command:
- /manager
+ - --gang-scheduler-name=volcano
image: kubeflow/training-operator
name: training-operator
...
But when it comes to Kueue, we only need to specify the label in the metadata without modifying the configuration of training-operator, which is more simple and user-friendly:
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
May I ask why we didn't implement a unified scheduling framework for these three schedulers? What prevents us from implementing such a unified scheduling framework?
Also, current ManagedBy field in RunPolicy only supports kubeflow.org/training-operator and kueue.x-k8s.io/multikueue. Maybe it will make users puzzled and think we only support kueue(or just for me)?
PTAL if you have time๐ @kubeflow/wg-training-leads
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
I have a question about https://www.kubeflow.org/docs/components/training/user-guides/job-scheduling/
For volcano and scheduler plugin, we need to configure the training-operator with:
... spec: containers: - command: - /manager + - --gang-scheduler-name=volcano image: kubeflow/training-operator name: training-operator ...But when it comes to Kueue, we only need to specify the label in the metadata without modifying the configuration of training-operator, which is more simple and user-friendly:
metadata: labels: kueue.x-k8s.io/queue-name: user-queueMay I ask why we didn't implement a unified scheduling framework for these three schedulers? What prevents us from implementing such a unified scheduling framework?
Also, current
ManagedByfield inRunPolicyonly supportskubeflow.org/training-operatorandkueue.x-k8s.io/multikueue. Maybe it will make users puzzled and think we only supportkueue(or just for me)?PTAL if you have time๐ @kubeflow/wg-training-leads
Actually kueue is a job and queue management system instead of a scheduler, it is usually used in conjunction with kube-scheduler, so I think that for kueue integration with kubeflow, it only needs queue related annotation.
@Monokaix Thanks for your clarification!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
still need.
/remove-lifecycle stale
@Monokaix Will be supported in #2437
Closing this one and let us track this feature by #2437 /close
@tenzen-y: Closing this issue.
In response to this:
Closing this one and let us track this feature by #2437 /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.