training-operator
training-operator copied to clipboard
The whole architecture is not clean
The whole kubeflow/training-operator relies on kubeflow/common heavily. But the kubeflow/common was designed for old Controller architecture like the k8s's Job/Deployment Controller. Now we are using controller-runtime as our architecture to implement Controller, so the mixed using kubeflow/common and controller-runtime makes plenty of redundant codes. It goes without saying that this architecture brings many disadvantages.
Some direct evidences: the following codes come JobController in kubeflow/common.
https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/controller.v1/common/job_controller.go#L104-L119
PodLister corelisters.PodLister
ServiceLister corelisters.ServiceLister
PriorityClassLister schedulinglisters.PriorityClassLister
PodInformerSynced cache.InformerSynced
ServiceInformerSynced cache.InformerSynced
PriorityClassInformerSynced cache.InformerSynced
WorkQueue workqueue.RateLimitingInterface
And following codes are NewReconciler of MPIJobReconciler.
https://github.com/kubeflow/training-operator/blob/e13d262fe4fabd548ded944e5356c99f30a33b00/pkg/controller.v1/mpi/mpijob_controller.go#L87-L99
In MPIJobReconciler struct:
The WorkQueue is useless, it is set by &util.FakeWorkQueue{}.
The PodInformerSynced and ServiceInformerSynced are useless, they are not initialized at all. The PriorityClassInformerSynced is also useless, but it has been initialized.
What I demonstrated here is just a part of this problem.
Referring to point2 of https://github.com/kubeflow/training-operator/issues/1703
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
cc @kuizhiqing @tenzen-y @johnugeorge
Yes, introducing the controller-runtime pattern to the common codes would be better. Actually, I mentioned about that: https://github.com/kubeflow/training-operator/issues/1816#issuecomment-1663721566
/lifecycle frozen