training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

The whole architecture is not clean

Open HeGaoYuan opened this issue 2 years ago • 3 comments

The whole kubeflow/training-operator relies on kubeflow/common heavily. But the kubeflow/common was designed for old Controller architecture like the k8s's Job/Deployment Controller. Now we are using controller-runtime as our architecture to implement Controller, so the mixed using kubeflow/common and controller-runtime makes plenty of redundant codes. It goes without saying that this architecture brings many disadvantages.

Some direct evidences: the following codes come JobController in kubeflow/common.

https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/controller.v1/common/job_controller.go#L104-L119

PodLister corelisters.PodLister

ServiceLister corelisters.ServiceLister

PriorityClassLister schedulinglisters.PriorityClassLister

PodInformerSynced cache.InformerSynced

ServiceInformerSynced cache.InformerSynced

PriorityClassInformerSynced cache.InformerSynced

WorkQueue workqueue.RateLimitingInterface

And following codes are NewReconciler of MPIJobReconciler.

https://github.com/kubeflow/training-operator/blob/e13d262fe4fabd548ded944e5356c99f30a33b00/pkg/controller.v1/mpi/mpijob_controller.go#L87-L99

In MPIJobReconciler struct:

The WorkQueue is useless, it is set by &util.FakeWorkQueue{}.

The PodInformerSynced and ServiceInformerSynced are useless, they are not initialized at all. The PriorityClassInformerSynced is also useless, but it has been initialized.

What I demonstrated here is just a part of this problem.

Referring to point2 of https://github.com/kubeflow/training-operator/issues/1703

HeGaoYuan avatar Dec 27 '22 09:12 HeGaoYuan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 24 '23 00:08 github-actions[bot]

cc @kuizhiqing @tenzen-y @johnugeorge

andreyvelich avatar Aug 24 '23 13:08 andreyvelich

Yes, introducing the controller-runtime pattern to the common codes would be better. Actually, I mentioned about that: https://github.com/kubeflow/training-operator/issues/1816#issuecomment-1663721566

/lifecycle frozen

tenzen-y avatar Aug 24 '23 13:08 tenzen-y