manifests icon indicating copy to clipboard operation
manifests copied to clipboard

Training Operator WG and Kubeflow 1.5 release

Open DomFleischmann opened this issue 3 years ago • 12 comments

@kubeflow/wg-training-leads let's use this tracking issue to coordinate the integration of Training Operator with the Kubeflow 1.5 release.

First off a heads up that the feature freeze phase will start Wednesday (26th January). Before then I'd like to have updated this repo with the manifests of the kubeflow/training-operator repo, in order to be able to cut the first RC tag in this repo.

So what I'd like to ask as a first step before the feature freeze is:

What version of Training Operator would you like to include for the 1.5 release? Could you provide me with a branch/tag for this version? It doesn't have to be final. The branch/tag provided can keep on getting fixes through out the release process, but not new features Are there any open issues/work in progress that you will be working on for your version as the KF release process will be progressing? What will the K8s supported versions be for kubeflow/training-operator?

DomFleischmann avatar Jan 19 '22 10:01 DomFleischmann

The RC release tag v1.4.0-rc.0 is created. https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0

@kimwnasptd In the last release, I see that manifests structure in this repo is different from training operator repo. Just wondering how did this happen?

Can you sync RC tagged manifests from training operator repo with this repo? https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0/manifests

MPI operator is bundled in training operator and this folder(https://github.com/kubeflow/manifests/tree/master/apps/mpi-job/upstream) can be deleted.

/cc @terrytangyuan

johnugeorge avatar Jan 26 '22 20:01 johnugeorge

Thanks for the update @johnugeorge @terrytangyuan!

@kimwnasptd In the last release, I see that manifests structure in this repo is different from training operator repo. Just wondering how did this happen?

Hmmm, not sure. It seems that the last PR that updated the Operator's manifests was for copying over the RC2 manifests https://github.com/kubeflow/manifests/pull/2032. Looking at the stable 1.3.0 version of the Operator I see the crds folder https://github.com/kubeflow/training-operator/tree/v1.3.0/manifests/base. So the issue was that we never updated the manifests from RC2 to the final release. Will keep in mind for this release.

I'll create a PR to update the manifests from Training Operator now, I have an automated script for this.

kimwnasptd avatar Jan 26 '22 21:01 kimwnasptd

Great. Thanks!

terrytangyuan avatar Jan 26 '22 21:01 terrytangyuan

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? https://github.com/kubeflow/community/pull/522

jbottum avatar Jan 26 '22 21:01 jbottum

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

Yes

terrytangyuan avatar Jan 26 '22 21:01 terrytangyuan

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

Elastic Pytorch training is supported through PyTorchJob in the new release and Elastic horovod training is supported through MPIJob

https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/echo/echo.yaml

https://github.com/kubeflow/mpi-operator/blob/master/examples/horovod/tensorflow-mnist-elastic.yaml

johnugeorge avatar Jan 27 '22 07:01 johnugeorge

Hi @kubeflow/wg-training-leads , Before the manifest testing on Wednesday, Feb 9th, the release team is planning on cutting another RC to use for the testing.

Based on a previous communication, the release team will be using Training Operator version v1.4.0rc0. If the Training WG have identified any issues since the feature freeze and would like to update the AutoML version before the manifest testing, let us know before Feb. 9th. Thank you!

@johnugeorge

DomFleischmann avatar Feb 08 '22 09:02 DomFleischmann

After syncing in today's AutoML/Training meeting we will keep on using the v1.4-rc0 tag for the RC1 of the manifests. A newer RC might be cut for the kubeflow/training-operator repo later on, in case more issues arise.

Also another note, the @kubeflow/wg-automl-leads will update the kubeflow/katib e2e tests to be using the v1.5-branch branch of the manifests. This means that the e2e tests will be using the latest training operators, so we'll be keeping an eye on issues that might arise.

kimwnasptd avatar Feb 09 '22 15:02 kimwnasptd

@kubeflow/wg-training-leads I'm working on finalizing the manifests for the release, as we are getting closer to the release date of March 9th.

Regarding the kubeflow/training-operator repo, when are you planning to cut the final v1.4.0 tag? Could you do it within this week so that we can get the manifests closer to their final state?

kimwnasptd avatar Mar 01 '22 07:03 kimwnasptd

@kimwnasptd Yes. we will cut it this week

johnugeorge avatar Mar 01 '22 11:03 johnugeorge

Just saw it's ready. Congrats on the release 🎉

kimwnasptd avatar Mar 04 '22 17:03 kimwnasptd

Hey, folks. Are there docs changes required as a result of this work? If so, please create an issue and mention in on this docs tracking issue: https://github.com/kubeflow/website/issues/3130

shannonbradshaw avatar Mar 07 '22 23:03 shannonbradshaw

/close

There has been no activity for a long time. Please reopen if necessary.

juliusvonkohout avatar Aug 24 '23 16:08 juliusvonkohout

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Aug 24 '23 16:08 google-oss-prow[bot]