common Proposal：add priority and queue in scheduling for the common operator

Problem

1.Currently in kube-batch，it has PodGroupSpec that it includes some status about scheduling policy,for example MinAvailable,Queue,PriorityClassName.But kubeflow operators don't provide the parameters for kube-batch now.

2.MPI-operator and tf-operator don't use common operator,and pytorch-operator and mxnet-operator use tf-operator/pkg/common package.

Proposed Solution

1.Supplement these attributions in type RunPolicy.SchedulingPolicy. When it uses kubeflow and kube-batch, kubeflow can pass parameters to kube-batch.

// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
    MinAvailable *int32 `json:"minAvailable,omitempty"`

    //PriorityClassName is a type of k8s resource.(kubectl get priorityclass)
    PriorityClassName *string `json:"priorityClassName,omitempty"`
  
    Queue *string `json:"queue,omitempty"`
}

2.All operators use common operator.Because tf,pytorch and mxnet are similar.The bad news is that mpi maybe need more changes.

Advantages

Unify all operators about runPolicy and packages where are imported.

Frameworks Support

pytorch

mxnet

mpi

tensorflow

Rough API Spec(pytorch-operator)

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  priorityClassName: high
  queue:default
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers: 
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1

Aug 26 '19 01:08 YesterdayxD

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.91. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Aug 26 '19 01:08 issue-label-bot[bot]

/cc @gaocegege /cc @k82cn

Aug 27 '19 07:08 YesterdayxD

/cc @richardsliu @johnugeorge @hougangliu

Thanks for the proposal!

Aug 28 '19 01:08 gaocegege

Are we going to inline SchedulingPolicy in PyTorchJobSpec? What's the suggestion to other operator?

Aug 30 '19 01:08 k82cn

maybe all operrator should add SchedulingPolicy,so we can add SchedulingPolicy in this comnon package

Sep 02 '19 08:09 davidstack

Yeah, we should add SchedulingPolicy to common. But pytorch-operator and tf-operator does not use common now. We should re-implement the logic in these operators, too

Sep 02 '19 08:09 gaocegege

Yes, we should implement the logic in MXNet-Operator too.

Sep 04 '19 06:09 4everming

@johnugeorge @richardsliu

Do you have any suggestion?

Sep 09 '19 10:09 gaocegege

retire wg-machine-learning？It is so bad.

Oct 21 '19 01:10 YesterdayxD

retire wg-machine-learning？It is so bad.

Nop for now, I'll help to maintain ML-WG for a while; if still not working items, we'll retire it :)

Oct 21 '19 07:10 k82cn

Hi is there any update?

Oct 29 '19 09:10 gaocegege

hm... are we going to do this feature?

Apr 15 '20 06:04 k82cn

Yes it's part of our roadmap so contribution is welcomed.

Apr 15 '20 13:04 terrytangyuan

I think we can close this refer to https://github.com/kubeflow/common/blob/21f5ba8833a2e21df17601497a08396c9bae9ab2/pkg/apis/common/v1/types.go#L204-L209

Jun 14 '22 09:06 kerthcet

common common copied to clipboard

Proposal：add priority and queue in scheduling for the common operator

Problem

Proposed Solution

Advantages

Frameworks Support

Rough API Spec(pytorch-operator)

common
common copied to clipboard