training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Introduce batch/v1 Job with Indexed completion mode

Open tenzen-y opened this issue 2 years ago • 25 comments

/kind discussion

We often implement features similar to batch/v1 Job (e.g., https://github.com/kubeflow/common/pull/196) since the training operator creates blocks of Pod + Service for each rank, not batch/v1 Job + Service once the custom Job resource (e.g., TFJob) is created.

IIUC, training-operator designed like the above since training-operator core architecture is created before the Indexed Job feature and Pod failure policy feature are released.

So I would like to propose the architecture that the training-operator creates batch/v1 Job with Indexed completion mode + Service, not Pod + Service.

Introducing batch/v1 Job eliminates the need to implement and maintain features similar to batch/v1 Job and makes introducing new batch/v1 Job features easy.

/cc @kubeflow/wg-training-leads

tenzen-y avatar Jan 10 '23 19:01 tenzen-y

Interesting. This would need a major rewrite ?

/cc @gaocegege @zw0610 @terrytangyuan

johnugeorge avatar Jan 12 '23 06:01 johnugeorge

This would need a major rewrite ?

Probably, yes. Replacing Pod with batch/v1 Indexed Job, we can use Job with Pod-to-Pod Communication logic.

Also, we maybe remove most of the kubeflow/common codes and use batch/v1 Job logic.

For example, we can replace ReplicaSpec.RestartPolicy with PodFailurePolicy:

https://github.com/kubeflow/common/blob/34276e9d2ffa39f5922479bff87dc5ed5ed94cfb/pkg/apis/common/v1/types.go#L79-L83

https://github.com/kubernetes/kubernetes/blob/c9ed04762f94a319d7b1fb718dc345491a32bea6/pkg/apis/batch/types.go#L220-L229

This means replacing that, we don't need to hold the logic to judge whether restart pods.

tenzen-y avatar Jan 12 '23 07:01 tenzen-y

/cc @alculquicondor

gaocegege avatar Jan 12 '23 07:01 gaocegege

IIUC, Indexed Job feature aims tf-operator, mpi-operator and more.

https://github.com/kubernetes/kubernetes/issues/99497#issuecomment-786776056

tenzen-y avatar Jan 12 '23 07:01 tenzen-y

Introducing batch/v1 Job eliminates the need to implement and maintain features similar to batch/v1 Job and makes introducing new batch/v1 Job features easy.

Since it's the major benefits is to reduce the duplicated developing workload regarding some features, do we have such a feature list which are not implemented in training-operator but will be completed once the batch/v1 Job was adopted?

zw0610 avatar Jan 12 '23 07:01 zw0610

Introducing batch/v1 Job eliminates the need to implement and maintain features similar to batch/v1 Job and makes introducing new batch/v1 Job features easy.

Since it's the major benefits is to reduce the duplicated developing workload regarding some features, do we have such a feature list which are not implemented in training-operator but will be completed once the batch/v1 Job was adopted?

Good point. I don't create such a list yet. At present, I found the suspend Job and pod disruption conditions.

I'll create a list and share it with you in this issue.

tenzen-y avatar Jan 12 '23 08:01 tenzen-y

+1 This has been discussed before #1303

If you could provide a list of any missing functionality in the Job API, we could add those to the roadmap. We did a lot of progress with failure policies, but IIUC, there's also a need for some form of success policy?

Also @ahg-g is working on a proposal for a multi-pod-template API, that he's going to present in the Batch working group meeting on Feb 2nd https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#heading=h.ukbaidczvy3r

alculquicondor avatar Jan 12 '23 14:01 alculquicondor

+1

The benefit is not just deduping code, but also helping to defragment the ecosystem. While I do understand the benefit of having dedicated APIs for MPI, TF training etc., it is important that they build on a common API that we can use for job-level scheduling and autoscaling.

As @alculquicondor mentioned, I am working on a proposal that I will make public next week and will be discussed in Kubernetes batch working group. I am also happy to schedule a time and discuss it with the kubeflow community, can you please let me know how/where I can put this topic on the meeting agenda?

ahg-g avatar Jan 12 '23 14:01 ahg-g

If you could provide a list of any missing functionality in the Job API, we could add those to the roadmap. We did a lot of progress with failure policies, but IIUC, there's also a need for some form of success policy?

@alculquicondor I think introducing the success policy would be useful for training-operator since we hold the logic by ourselves in tensorflow-controller.

https://github.com/kubeflow/training-operator/blob/ddf372c8fe0d179674f3dc5e61fbd71246c32924/pkg/controller.v1/tensorflow/tfjob_controller.go#L505-L534

Maybe, tensorflow-controller or training-controller can be one of the use cases to introduce the success policy to batch/v1 Job.

Also @ahg-g is working on a proposal for a multi-pod-template API, that he's going to present in the Batch working group meeting on Feb 2nd https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#heading=h.ukbaidczvy3r

Thanks for sharing. I'm interested a muti-pod-template API since we can consider using the API after we introduce batch/v1 Job to training-operator.

Are there KEPs for a multi-pod-template API in k/enhancement?

tenzen-y avatar Jan 12 '23 17:01 tenzen-y

The benefit is not just deduping code, but also helping to defragment the ecosystem. While I do understand the benefit of having dedicated APIs for MPI, TF training etc., it is important that they build on a common API that we can use for job-level scheduling and autoscaling.

@ahg-g Yes, exactly. I think so too.

As @alculquicondor mentioned, I am working on a proposal that I will make public next week and will be discussed in Kubernetes batch working group. I am also happy to schedule a time and discuss it with the kubeflow community, can you please let me know how/where I can put this topic on the meeting agenda?

We have bi-weekly community meetings for WG Training, and there is a meeting note in https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#.

I rarely attend meetings, but you can share a multiple-pod-template API with WG Training leads.

tenzen-y avatar Jan 12 '23 17:01 tenzen-y

Are there KEPs for a multi-pod-template API in k/enhancement?

Not yet, as I mentioned above, I will share a google doc next week, it is easier to discuss such a significant proposal on a google doc first before we move to a KEP. Note that the plan is to initially host the API under the Kueue project to iterate fast on the api with the goal of upstreaming it eventually.

ahg-g avatar Jan 12 '23 17:01 ahg-g

Are there KEPs for a multi-pod-template API in k/enhancement?

Not yet, as I mentioned above, I will share a google doc next week, it is easier to discuss such a significant proposal on a google doc first before we move to a KEP. Note that the plan is to initially host the API under the Kueue project to iterate fast on the api with the goal of upstreaming it eventually.

I see. Thanks for letting me know.

tenzen-y avatar Jan 12 '23 17:01 tenzen-y

I will work on this issue after the kubeflow v1.7 feature freeze date since that date is coming up. Then, I will share the corresponding table for batch/v1 Job and training-operator Job feature in this issue.

If I find this issue with significant API changes, I will submit a proposal to this repository.

Also, I will work on the actual implementation after #1714 is done.

tenzen-y avatar Jan 13 '23 03:01 tenzen-y

/cc @richardsliu

ahg-g avatar Jan 13 '23 19:01 ahg-g

Maybe, we need to wait for the Indexed Jobs with unset completions parameter feature to support Elastic PytorchJob.

ref: https://github.com/kubernetes/enhancements/issues/3715

tenzen-y avatar Jan 17 '23 18:01 tenzen-y

Is elastic Pytorch the only training job that supports resizing?

Does it matter which workers get removed?

alculquicondor avatar Jan 17 '23 18:01 alculquicondor

Is elastic Pytorch the only training job that supports resizing?

IIUC, we are supporting only one master replica for PytorchJob. So yes.

https://github.com/kubeflow/training-operator/blob/b87c6fa0be186ecf8bc901c74cdc52a529ddd536/pkg/apis/kubeflow.org/v1/pytorch_validation.go#L62-L66

Does it matter which workers get removed?

Maybe, It does not matter which worker is deleted since the Elastic PytorchJob uses a local elastic agent.

@gaocegege @zw0610 If I misunderstand the Elastic PytorchJob, can you correct me?

tenzen-y avatar Jan 17 '23 19:01 tenzen-y

Also, we may use Indexed Jobs with unset completions parameter feature for MPIJob v1 with Horovod.

tenzen-y avatar Jan 17 '23 19:01 tenzen-y

The Elastic Indexed Job is supposed to graduate to beta in K8s 1.27. So we can work on this once we stop supporting k8s 1.26 (maybe next year?).

tenzen-y avatar Jan 26 '23 04:01 tenzen-y

I agree

alculquicondor avatar Jan 26 '23 13:01 alculquicondor

We may be able to introduce the JobSet instead of batch/job, although I think we need to wait for the JobSet API to be in beta.

https://github.com/kubernetes-sigs/jobset

tenzen-y avatar May 04 '23 11:05 tenzen-y

As a first step, migrating to batch/job might be better. After that, we migrate to the JobSet. Since directly migrating to the JobSet has too much influence on the training operator.

tenzen-y avatar May 07 '23 22:05 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 24 '23 00:08 github-actions[bot]

/lifecycle frozen

tenzen-y avatar Aug 24 '23 13:08 tenzen-y

/assign We're starting the discussion based on my enhancement proposal.

tenzen-y avatar May 15 '24 07:05 tenzen-y