manifests icon indicating copy to clipboard operation
manifests copied to clipboard

AutoML WG and Kubeflow 1.5 release

Open DnPlas opened this issue 3 years ago • 12 comments

@kubeflow/wg-automl-leads let's use this tracking issue to coordinate the integration of AutoML with the Kubeflow 1.5 release.

First off a heads up that the feature freeze phase will start Tuesday (25th January). Before then I'd like to have updated this repo with the manifests of the kubeflow/katib repo, in order to be able to cut the first RC tag in this repo.

So what I'd like to ask as a first step before the feature freeze is:

  1. What version of Katib would you like to include for the 1.5 release?
  2. Could you provide me with a branch/tag for this version? It doesn't have to be final. The branch/tag provided can keep on getting fixes through out the release process, but not new features
  3. Are there any open issues/work in progress that you will be working on for your version as the KF release process will be progressing?
  4. What will the K8s supported versions be for kubeflow/katib?

DnPlas avatar Jan 19 '22 17:01 DnPlas

From the versioning issue we had we know we are targeting 0.13 https://github.com/kubeflow/manifests/issues/2098#issuecomment-1011157180. @kubeflow/wg-automl-leads let's use this issue for further updates, new tags, progressing issues etc.

kimwnasptd avatar Jan 24 '22 16:01 kimwnasptd

Hi @kubeflow/wg-automl-leads , Before the manifest testing on Wednesday, Feb 9th, the release team is planning on cutting another RC to use for the testing.

Based on a previous communication, the release team will be using AutoML version 0.13rc0. If the AutoML WG have identified any issues since the feature freeze and would like to update the AutoML version before the manifest testing, let us know before Feb. 9th. Thank you!

@andreyvelich

DomFleischmann avatar Feb 08 '22 08:02 DomFleischmann

After syncing in today's AutoML we will keep on using the 0.13-rc0 tag, for the RC1 of the Manifests. A newer RC might be cut for the kubeflow/katib repo later on, in case more issues arise.

Also another note, the @kubeflow/wg-automl-leads will update the kubeflow/katib e2e tests to be using the v1.5-branch branch of the manifests. This means that the e2e tests will be using the latest training operators, so we'll be keeping an eye on issues that might arise.

kimwnasptd avatar Feb 09 '22 15:02 kimwnasptd

deployed kubeflow from v1.5-branch and ran this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb I encountered this issue: https://github.com/kubeflow/katib/issues/1795

I found the metric collector is not injected into the trial pod:

mnist-e2e-jxnc28x2-chief-0                                        0/1     Completed 
mnist-e2e-jxnc28x2-worker-0                                       0/1     Completed

Does anyone have the same issue? not sure if this is the right place to discuss/report this.

BTW, early-stop sample works well and I do see metric collector container was injected:

median-stop-new2-nxh6jbn7-h7h48                                   0/2     Completed 

yhwang avatar Feb 09 '22 21:02 yhwang

Thanks for raising this @yhwang! I also bumped into this when writing the e2e tests

The fix for this should be to use training.kubeflow.org/job-role: master as the PrimaryPodLabel. Here's how I did it in the codified version of the above notebook: https://github.com/kubeflow/manifests/pull/2128/files#diff-ba317d8735e3ac6c584fe8dc196fddb304ad5e548b94599c35eeb59bcfa8e89eR159

We also discussed this in this week's AutoML meeting, and we'll expose the full list of annotations/changes users need to keep in mind for the new 1.4 version of the Training Operators.

kimwnasptd avatar Feb 10 '22 09:02 kimwnasptd

thanks @kimwnasptd I tried training.kubeflow.org/job-role: master and the metric collector is injected. however, it only finished 1st trial, and no more sequential trial was scheduled. The experiment is still in the running state but no more progress. do you have the same issue?

yhwang avatar Feb 10 '22 18:02 yhwang

Haven't bumped into this, in my case with a KinD 1.20 cluster all the trials got to Succeeded state after running the test https://github.com/kubeflow/manifests/blob/master/tests/e2e/runner.sh.

Can you open a distinct issue in the kubeflow/katib so that we can get more deep into it?

I'll also start using Prow for the e2e tests with AWS clusters in the manifests repo, I'll give a heads up if I bump into this.

kimwnasptd avatar Feb 11 '22 10:02 kimwnasptd

forgot to update you on my latest status of katib. the problem seems to be a tfjob from previous run got stuck in a weird state. after I removed that job, my katib works well. thanks for the script and hint.

yhwang avatar Feb 15 '22 17:02 yhwang

@andreyvelich @johnugeorge @gaocegege I'm working on finalizing the manifests for the release, as we are getting closer to the release date of March 9th.

Regarding the kubeflow/katib repo, when are you planning to cut the final v0.13 tag? Could you do it within this week so that we can get the manifests closer to their final state?

kimwnasptd avatar Mar 01 '22 07:03 kimwnasptd

@kimwnasptd . we will do it this week

johnugeorge avatar Mar 01 '22 11:03 johnugeorge

Just saw it's ready. Congrats on the release 🎉

kimwnasptd avatar Mar 04 '22 17:03 kimwnasptd

Hey folks, any docs changes required as a result of this work? Please create an issue and mention it on this tracking issue. https://github.com/kubeflow/website/issues/3130

shannonbradshaw avatar Mar 07 '22 23:03 shannonbradshaw

This effort has been finalised.

DnPlas avatar Apr 25 '23 13:04 DnPlas