Unified training operator working progress
@zw0610 and I present all-in-one training operator proposal in last month community meeting.
WG-Training leads have already agreed to move forward. This issue is created to track implementation progress. The desired alpha release of this new unified operator will be Kubeflow 1.4
Configuration and deployment
| Description | Category | Status | Issue |
|---|---|---|---|
| Kustomize package | Required | Done | |
| Application CR | Required | Not Done | |
| Images listed in kustomization.yaml | Required | Not Done | |
| Upgradeability | Required | Not Done | |
| Separate cluster scoped and namespace scoped resources | Recommended | Not Done | N/A |
| Kustomize package should be deployable on its own | Recommended | Done | Need to coordinate with 1.4 release |
Custom Resources
| Description | Category | Status | Issue |
|---|---|---|---|
| Version stability | Required | Not Done | |
| Backward compatibility | Required | Not Done | |
| Supports status subresource | Required | Done | All jobs have status to reflect the real status |
| CRD schema validation | Required | Not Done | |
| Training operators follow kubeflow/common conventions | Required | Done | https://github.com/kubeflow/tf-operator/pull/1296 https://github.com/kubeflow/tf-operator/pull/1295 https://github.com/kubeflow/tf-operator/pull/1294 https://github.com/kubeflow/tf-operator/pull/1293 |
Observability
| Description | Category | Status | Issue |
|---|---|---|---|
| Liveness/Readiness signals | Required | Not Done | |
| Prometheus metrics and Graphs | Required | Not Done | |
| Job Events | Required | Not Done | |
| Json logging | Recommended | Not Done |
CI/CD
| Description | Category | Status | Issue |
|---|---|---|---|
| E2E tests | Required | Not Done | |
| Scalability / load testing | Required | Not Done | |
| Continuous building of docker images | Recommended | Not Done | https://github.com/kubeflow/testing/pull/951 |
| Continuous updating of Kustomize manifests | Recommended | Not Done | This is not valid anymore - kubeflow/manifests will fetch repo's kustomize manifest |
Docs
| Description | Category | Status | Issue |
|---|---|---|---|
| API Reference docs | Required | Not Done | |
| Application docs | Required | Not Done |
Owners/Maintenance
| Description | Category | Explanation | Status | Issue |
|---|---|---|---|---|
| Healthy number of committers and commits | Required | Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application | Not Done | |
| At least 2 different organizations are committers | Required | Not Done |
Adoption
| Description | Category | Explanation | Status | Issue |
|---|---|---|---|---|
| List of users running the application | Recommended | Suggest listing adopters willing to be identified publicly in ADOPTERS.md | Not Done |
Things to figure out.
- code repo process, project name -> confirm with Bobby.
- tech stack? Kubebuilder version, Kubernetes version etc
- integration environments - Prow or Github Actions, Where to hold the images? Andrey
- API version management & clientset generation
- Development cycle
An update on above items. @zw0610 @kubeflow/wg-training-leads
- code repo process, project name -> confirm with Boggy.
reuse tf-operator and rename to kubeflow/training-operator. pending confirmation with Boggy.
all issues, commits, followers, start will be transferred to new repo.
- tech stack? Kubebuilder version, Kubernetes version etc
kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2
- integration environments - Prow or Github Actions, Where to hold the images?
reuse our PROW test jobs in all-in-one-operator branch. use AWS public images and CD for short term.
- API version management & clientset generation
Start from v1 API since we plan to reuse most of the existing specs in phase 1. clients generation will be postposed until we see some other repos want to leverage it.
- Development cycle
use tf-operator separate develop branch (July 16) -> when features are all ready, merge back to master (2 weeks review by training leads) -> clean up code base (1 week) -> rename the repo (1month and catch 1.4 release)
We plan to have an alpha rc release by training & automl summit. (July 16).
Thank you for driving this @Jeffwan!
kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2
Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?
clients generation will be postposed until we see some other repos want to leverage it.
Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?
Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?
Yeah, this is flexible. Since current repo use lower version. We plan to have a 1.19 as a start and then jump to 1.21 once we merge back to master. Just in case someone user lower version and we want to have a tag or release for those users.
Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?
Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?
Sounds good @Jeffwan.
Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?
No, we are only using APIs from the TFJob: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L28 to validate TFJob, etc. But this also can be omitted from our side since it's not necessary. cc @kubeflow/wg-automl-leads
@Jeffwan Great. Can we merge code in phase as review will be easier?
@johnugeorge sure. I will cc all training leads for PRs coming into feature branch.