common Unified training operator working progress

@zw0610 and I present all-in-one training operator proposal in last month community meeting.

WG-Training leads have already agreed to move forward. This issue is created to track implementation progress. The desired alpha release of this new unified operator will be Kubeflow 1.4

Configuration and deployment

Description	Category	Status	Issue
Kustomize package	Required	Done
Application CR	Required	Not Done
Images listed in kustomization.yaml	Required	Not Done
Upgradeability	Required	Not Done
Separate cluster scoped and namespace scoped resources	Recommended	Not Done	N/A
Kustomize package should be deployable on its own	Recommended	Done	Need to coordinate with 1.4 release

Custom Resources

Description	Category	Status	Issue
Version stability	Required	Not Done
Backward compatibility	Required	Not Done
Supports status subresource	Required	Done	All jobs have status to reflect the real status
CRD schema validation	Required	Not Done
Training operators follow kubeflow/common conventions	Required	Done	https://github.com/kubeflow/tf-operator/pull/1296 https://github.com/kubeflow/tf-operator/pull/1295 https://github.com/kubeflow/tf-operator/pull/1294 https://github.com/kubeflow/tf-operator/pull/1293

Observability

Description	Category	Status
Liveness/Readiness signals	Required	Not Done
Prometheus metrics and Graphs	Required	Not Done
Job Events	Required	Not Done
Json logging	Recommended	Not Done

CI/CD

Description	Category	Status	Issue
E2E tests	Required	Not Done
Scalability / load testing	Required	Not Done
Continuous building of docker images	Recommended	Not Done	https://github.com/kubeflow/testing/pull/951
Continuous updating of Kustomize manifests	Recommended	Not Done	This is not valid anymore - kubeflow/manifests will fetch repo's kustomize manifest

Docs

Description	Category	Status	Issue
API Reference docs	Required	Not Done
Application docs	Required	Not Done

Owners/Maintenance

Description	Category	Explanation	Status	Issue
Healthy number of committers and commits	Required	Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application	Not Done
At least 2 different organizations are committers	Required		Not Done

Adoption

Description	Category	Explanation	Status	Issue
List of users running the application	Recommended	Suggest listing adopters willing to be identified publicly in ADOPTERS.md	Not Done

Jun 26 '21 23:06 Jeffwan

Things to figure out.

code repo process, project name -> confirm with Bobby.
tech stack? Kubebuilder version, Kubernetes version etc
integration environments - Prow or Github Actions, Where to hold the images? Andrey
API version management & clientset generation
Development cycle

Jun 27 '21 00:06 Jeffwan

An update on above items. @zw0610 @kubeflow/wg-training-leads

code repo process, project name -> confirm with Boggy.

reuse tf-operator and rename to kubeflow/training-operator. pending confirmation with Boggy.

all issues, commits, followers, start will be transferred to new repo.

tech stack? Kubebuilder version, Kubernetes version etc

kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2

integration environments - Prow or Github Actions, Where to hold the images?

reuse our PROW test jobs in all-in-one-operator branch. use AWS public images and CD for short term.

API version management & clientset generation

Start from v1 API since we plan to reuse most of the existing specs in phase 1. clients generation will be postposed until we see some other repos want to leverage it.

Development cycle

use tf-operator separate develop branch (July 16) -> when features are all ready, merge back to master (2 weeks review by training leads) -> clean up code base (1 week) -> rename the repo (1month and catch 1.4 release)

We plan to have an alpha rc release by training & automl summit. (July 16).

Jul 05 '21 18:07 Jeffwan

Thank you for driving this @Jeffwan!

kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2

Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?

clients generation will be postposed until we see some other repos want to leverage it.

Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?

Jul 05 '21 19:07 andreyvelich

Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?

Yeah, this is flexible. Since current repo use lower version. We plan to have a 1.19 as a start and then jump to 1.21 once we merge back to master. Just in case someone user lower version and we want to have a tag or release for those users.

Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?

Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?

Jul 06 '21 04:07 Jeffwan

Sounds good @Jeffwan.

Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?

No, we are only using APIs from the TFJob: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L28 to validate TFJob, etc. But this also can be omitted from our side since it's not necessary. cc @kubeflow/wg-automl-leads

Jul 06 '21 13:07 andreyvelich

@Jeffwan Great. Can we merge code in phase as review will be easier?

Jul 06 '21 18:07 johnugeorge

@johnugeorge sure. I will cc all training leads for PRs coming into feature branch.

Jul 07 '21 00:07 Jeffwan