training-operator
training-operator copied to clipboard
MPIJobs and Istio
I tried to run some MPIJobs with Istio enabled in the user namespaces but have bumped into a couple of issues. I'll use this issue to expose the bugs that occurred as well as proposed solutions. Although we might need to break this into smaller issues.
I used the tensorflow-benchmarks example, so this will be my point of reference.
The problems we've observed are the following:
- The main container will need to wait for the sidecar to start
- We can use Istio's
proxy.istio.io/config: '{"holdApplicationUntilProxyStarts": true}'annotation on the pods
- We can use Istio's
- Workers communicate with the Launcher Pod via Pod IPs, which goes through Istio's PassthroughCluster. This could be a problem in more security strict environments where mTLS mode is STRICT (pod-to-pod is not mTLS so requests will be blocked)
- https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/#headless-services
- https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/#unmatched-traffic
- The Launcher will use
kubectl execwhich can use the sidecar- We could extend the controller to set the
kubectl.kubernetes.io/default-containerannotation in the worker pods
- We could extend the controller to set the
Apologies if there were duplicate issues for Istio and MPIJob. Are there any plans for mitigating some of these issues and ensuring MPIJob can work with Istio?
Also if I need to open this issue in kubeflow/training-operator please tell me and I'll open a new one.
cc @terrytangyuan @johnugeorge @alculquicondor @gaocegege
/cc @zw0610
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen