manifests
manifests copied to clipboard
AutoML WG and Kubeflow 1.5 release
@kubeflow/wg-automl-leads let's use this tracking issue to coordinate the integration of AutoML with the Kubeflow 1.5 release.
First off a heads up that the feature freeze phase will start Tuesday (25th January). Before then I'd like to have updated this repo with the manifests of the kubeflow/katib
repo, in order to be able to cut the first RC tag in this repo.
So what I'd like to ask as a first step before the feature freeze is:
- What version of Katib would you like to include for the 1.5 release?
- Could you provide me with a branch/tag for this version? It doesn't have to be final. The branch/tag provided can keep on getting fixes through out the release process, but not new features
- Are there any open issues/work in progress that you will be working on for your version as the KF release process will be progressing?
- What will the K8s supported versions be for
kubeflow/katib
?
From the versioning issue we had we know we are targeting 0.13 https://github.com/kubeflow/manifests/issues/2098#issuecomment-1011157180. @kubeflow/wg-automl-leads let's use this issue for further updates, new tags, progressing issues etc.
Hi @kubeflow/wg-automl-leads , Before the manifest testing on Wednesday, Feb 9th, the release team is planning on cutting another RC to use for the testing.
Based on a previous communication, the release team will be using AutoML version 0.13rc0. If the AutoML WG have identified any issues since the feature freeze and would like to update the AutoML version before the manifest testing, let us know before Feb. 9th. Thank you!
@andreyvelich
After syncing in today's AutoML we will keep on using the 0.13-rc0
tag, for the RC1 of the Manifests. A newer RC might be cut for the kubeflow/katib repo later on, in case more issues arise.
Also another note, the @kubeflow/wg-automl-leads will update the kubeflow/katib e2e tests to be using the v1.5-branch
branch of the manifests. This means that the e2e tests will be using the latest training operators, so we'll be keeping an eye on issues that might arise.
deployed kubeflow from v1.5-branch and ran this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb I encountered this issue: https://github.com/kubeflow/katib/issues/1795
I found the metric collector is not injected into the trial pod:
mnist-e2e-jxnc28x2-chief-0 0/1 Completed
mnist-e2e-jxnc28x2-worker-0 0/1 Completed
Does anyone have the same issue? not sure if this is the right place to discuss/report this.
BTW, early-stop sample works well and I do see metric collector container was injected:
median-stop-new2-nxh6jbn7-h7h48 0/2 Completed
Thanks for raising this @yhwang! I also bumped into this when writing the e2e tests
The fix for this should be to use training.kubeflow.org/job-role: master
as the PrimaryPodLabel. Here's how I did it in the codified version of the above notebook:
https://github.com/kubeflow/manifests/pull/2128/files#diff-ba317d8735e3ac6c584fe8dc196fddb304ad5e548b94599c35eeb59bcfa8e89eR159
We also discussed this in this week's AutoML meeting, and we'll expose the full list of annotations/changes users need to keep in mind for the new 1.4 version of the Training Operators.
thanks @kimwnasptd I tried training.kubeflow.org/job-role: master
and the metric collector is injected. however, it only finished 1st trial, and no more sequential trial was scheduled. The experiment is still in the running state but no more progress. do you have the same issue?
Haven't bumped into this, in my case with a KinD 1.20 cluster all the trials got to Succeeded
state after running the test https://github.com/kubeflow/manifests/blob/master/tests/e2e/runner.sh.
Can you open a distinct issue in the kubeflow/katib so that we can get more deep into it?
I'll also start using Prow for the e2e tests with AWS clusters in the manifests repo, I'll give a heads up if I bump into this.
forgot to update you on my latest status of katib. the problem seems to be a tfjob from previous run got stuck in a weird state. after I removed that job, my katib works well. thanks for the script and hint.
@andreyvelich @johnugeorge @gaocegege I'm working on finalizing the manifests for the release, as we are getting closer to the release date of March 9th.
Regarding the kubeflow/katib
repo, when are you planning to cut the final v0.13
tag? Could you do it within this week so that we can get the manifests closer to their final state?
@kimwnasptd . we will do it this week
Just saw it's ready. Congrats on the release 🎉
Hey folks, any docs changes required as a result of this work? Please create an issue and mention it on this tracking issue. https://github.com/kubeflow/website/issues/3130
This effort has been finalised.