common
common copied to clipboard
Proposal:add priority and queue in scheduling for the common operator
Problem
1.Currently in kube-batch,it has PodGroupSpec that it includes some status about scheduling policy,for example MinAvailable,Queue,PriorityClassName.But kubeflow operators don't provide the parameters for kube-batch now.
2.MPI-operator and tf-operator don't use common operator,and pytorch-operator and mxnet-operator use tf-operator/pkg/common package.
Proposed Solution
1.Supplement these attributions in type RunPolicy.SchedulingPolicy. When it uses kubeflow and kube-batch, kubeflow can pass parameters to kube-batch.
// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
MinAvailable *int32 `json:"minAvailable,omitempty"`
//PriorityClassName is a type of k8s resource.(kubectl get priorityclass)
PriorityClassName *string `json:"priorityClassName,omitempty"`
Queue *string `json:"queue,omitempty"`
}
2.All operators use common operator.Because tf,pytorch and mxnet are similar.The bad news is that mpi maybe need more changes.
Advantages
Unify all operators about runPolicy and packages where are imported.
Frameworks Support
pytorch
mxnet
mpi
tensorflow
Rough API Spec(pytorch-operator)
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-gloo"
spec:
priorityClassName: high
queue:default
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: gcr.io/<your_project>/pytorch_dist_mnist:latest
args: ["--backend", "gloo"]
# Comment out the below resources to use the CPU.
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: gcr.io/<your_project>/pytorch_dist_mnist:latest
args: ["--backend", "gloo"]
# Comment out the below resources to use the CPU.
resources:
limits:
nvidia.com/gpu: 1
Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.91. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
/cc @gaocegege /cc @k82cn
/cc @richardsliu @johnugeorge @hougangliu
Thanks for the proposal!
Are we going to inline SchedulingPolicy in PyTorchJobSpec? What's the suggestion to other operator?
maybe all operrator should add SchedulingPolicy,so we can add SchedulingPolicy in this comnon package
Yeah, we should add SchedulingPolicy to common. But pytorch-operator and tf-operator does not use common now. We should re-implement the logic in these operators, too
Yes, we should implement the logic in MXNet-Operator too.
@johnugeorge @richardsliu
Do you have any suggestion?
retire wg-machine-learning?It is so bad.
retire wg-machine-learning?It is so bad.
Nop for now, I'll help to maintain ML-WG for a while; if still not working items, we'll retire it :)
Hi is there any update?
hm... are we going to do this feature?
Yes it's part of our roadmap so contribution is welcomed.
I think we can close this refer to https://github.com/kubeflow/common/blob/21f5ba8833a2e21df17601497a08396c9bae9ab2/pkg/apis/common/v1/types.go#L204-L209