training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

MPIJobs and Istio

Open kimwnasptd opened this issue 2 years ago • 4 comments

I tried to run some MPIJobs with Istio enabled in the user namespaces but have bumped into a couple of issues. I'll use this issue to expose the bugs that occurred as well as proposed solutions. Although we might need to break this into smaller issues.

I used the tensorflow-benchmarks example, so this will be my point of reference.

The problems we've observed are the following:

  1. The main container will need to wait for the sidecar to start
  2. Workers communicate with the Launcher Pod via Pod IPs, which goes through Istio's PassthroughCluster. This could be a problem in more security strict environments where mTLS mode is STRICT (pod-to-pod is not mTLS so requests will be blocked)
    • https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/#headless-services
    • https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/#unmatched-traffic
  3. The Launcher will use kubectl exec which can use the sidecar

Apologies if there were duplicate issues for Istio and MPIJob. Are there any plans for mitigating some of these issues and ensuring MPIJob can work with Istio?

Also if I need to open this issue in kubeflow/training-operator please tell me and I'll open a new one.

kimwnasptd avatar Nov 02 '22 14:11 kimwnasptd

cc @terrytangyuan @johnugeorge @alculquicondor @gaocegege

andreyvelich avatar Nov 08 '22 12:11 andreyvelich

/cc @zw0610

gaocegege avatar Nov 08 '22 13:11 gaocegege

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 24 '23 15:08 github-actions[bot]

/lifecycle frozen

tenzen-y avatar Aug 28 '23 06:08 tenzen-y