common icon indicating copy to clipboard operation
common copied to clipboard

Proposal:add priority and queue in scheduling for the common operator

Open YesterdayxD opened this issue 6 years ago • 14 comments

Problem

1.Currently in kube-batch,it has PodGroupSpec that it includes some status about scheduling policy,for example MinAvailable,Queue,PriorityClassName.But kubeflow operators don't provide the parameters for kube-batch now.

2.MPI-operator and tf-operator don't use common operator,and pytorch-operator and mxnet-operator use tf-operator/pkg/common package.

Proposed Solution

1.Supplement these attributions in type RunPolicy.SchedulingPolicy. When it uses kubeflow and kube-batch, kubeflow can pass parameters to kube-batch.

// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
    MinAvailable *int32 `json:"minAvailable,omitempty"`

    //PriorityClassName is a type of k8s resource.(kubectl get priorityclass)
    PriorityClassName *string `json:"priorityClassName,omitempty"`
  
    Queue *string `json:"queue,omitempty"`
}

2.All operators use common operator.Because tf,pytorch and mxnet are similar.The bad news is that mpi maybe need more changes.

Advantages

Unify all operators about runPolicy and packages where are imported.

Frameworks Support

pytorch

mxnet

mpi

tensorflow

Rough API Spec(pytorch-operator)

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  priorityClassName: high
  queue:default
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers: 
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1

YesterdayxD avatar Aug 26 '19 01:08 YesterdayxD

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.91. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Aug 26 '19 01:08 issue-label-bot[bot]

/cc @gaocegege /cc @k82cn

YesterdayxD avatar Aug 27 '19 07:08 YesterdayxD

/cc @richardsliu @johnugeorge @hougangliu

Thanks for the proposal!

gaocegege avatar Aug 28 '19 01:08 gaocegege

Are we going to inline SchedulingPolicy in PyTorchJobSpec? What's the suggestion to other operator?

k82cn avatar Aug 30 '19 01:08 k82cn

maybe all operrator should add SchedulingPolicy,so we can add SchedulingPolicy in this comnon package

davidstack avatar Sep 02 '19 08:09 davidstack

Yeah, we should add SchedulingPolicy to common. But pytorch-operator and tf-operator does not use common now. We should re-implement the logic in these operators, too

gaocegege avatar Sep 02 '19 08:09 gaocegege

Yes, we should implement the logic in MXNet-Operator too.

4everming avatar Sep 04 '19 06:09 4everming

@johnugeorge @richardsliu

Do you have any suggestion?

gaocegege avatar Sep 09 '19 10:09 gaocegege

retire wg-machine-learning?It is so bad.

YesterdayxD avatar Oct 21 '19 01:10 YesterdayxD

retire wg-machine-learning?It is so bad.

Nop for now, I'll help to maintain ML-WG for a while; if still not working items, we'll retire it :)

k82cn avatar Oct 21 '19 07:10 k82cn

Hi is there any update?

gaocegege avatar Oct 29 '19 09:10 gaocegege

hm... are we going to do this feature?

k82cn avatar Apr 15 '20 06:04 k82cn

Yes it's part of our roadmap so contribution is welcomed.

terrytangyuan avatar Apr 15 '20 13:04 terrytangyuan

I think we can close this refer to https://github.com/kubeflow/common/blob/21f5ba8833a2e21df17601497a08396c9bae9ab2/pkg/apis/common/v1/types.go#L204-L209

kerthcet avatar Jun 14 '22 09:06 kerthcet