Dynamic roles which can technically support any potential frameworks
Every framework's implementation is pretty close and I am thinking we actually don't need that many controllers/operator. If we can support custom roles, most popular framework can adapt to it.
The major challenge is to let controller know how it can construct the environment of cluster spec. If there's a way to represent it in annotation/label etc, that might be a feasible way. I am also open to other options
I think it's a great idea to support custom roles. My personal experience tells me there exist many situation what we need to further extend the definition of roles. Moreover, we shall not limit the customization to the pod environment. Instead, it might be a good idea to let user to 'decorator' the pod template for each customized role.
Without changing too much to the architecture of the contemporary design of kubeflow operators, I would suggest the following approaches:
- We can
DecoratePod(temple *corev1.PodTemplate, rtype commonv1.ReplicaType)method inPodReconcilerInterfaceand letConstructPod(ReconcilerPod->CreatePod->ConstructPod) to call theDecoratePodmethod just before return - When launching the manager, user can specify if customization server address by
ReplicaTypelike/opt/kubeflow/tf-operator.v1 --decorator CWorker,10.1.2.9:8080,PSX,/var/psx.sockand these info will be registered in the manager. - In the implementation of
BasePodReconciler(which implements the base functionality ofPodReconcilerInterface), it just do nothing to thetemplate *corev1.PodTemplateif user does not specifies the correspondingReplicaType, otherwise it shall call the registered decorator server to update the pod template.
- If developers prefer to modify the source code and re-complie & re-deploy the operator, simply override the implementation of
DecoratePodinDerivedPodReconcilerorXXXJobPodReconcilerso it can switch to different decoration way based on theReplicaType. - If developers prefer not to modify the existing code of the operator (like we'd like to add customized role to tf-operator without re-compiling tf-operator), just deploy the corresponding decorator server, expose it and specify the address in the launching args.
Yeah. I am thinking how we can insert "clusterSpec" environment for different frameworks?
{
"worker": ["worker0.example.com:2222","worker1.example.com:2222","worker2.example.com:2222"],
"ps": ["ps0.example.com:2222","ps1.example.com:2222"]
}
different framework have different settings on this part. The most easiest way is to have some predefined templates in the code.
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: multi-worker
labels:
framework: tensorflow -> CustomJob can leverage this label to determine how it injects the environment.
-> We can even put typology format here to further simplify controller work but it will be buggy.
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
....