mpi-operator when worker num is higher，launcher failed as "kubectl not found"

when worker num is higher，launcher failed as "kubectl not found"

Open antshuwen opened this issue 5 years ago • 4 comments

we have been troubled by a wired problem. When worker num is lower，everything is ok. But when worker num is higher，in our case 70，launcher failed as "kubect not found". The belowing is detail logs:

We consulted experts and got some suggestions. Finally we put kubectl and config in the right dir of our image in advance，everything changes right. We are still confused. Welcome to discuss.

Dec 11 '19 11:12 antshuwen

Also found the same issue in our cluster...

Dec 11 '19 16:12 ywskycn

It's possible that deliver_kubectl.sh is not being executed successfully (or finished) on each worker so that /opt/kube/kubectl may not exist yet. If you copy kubectl to /opt/kube/kubectl in advance in this Dockerfile (similar to my offline suggestion to you earlier), this problem should be fixed.

Dec 11 '19 17:12 terrytangyuan

It's possible that deliver_kubectl.sh is not being executed successfully (or finished) on each worker so that /opt/kube/kubectl may not exist yet. If you copy kubectl to /opt/kube/kubectl in advance in this Dockerfile (similar to my offline suggestion to you earlier), this problem should be fixed.

We found that all the mount path deleted during sending clusterspec to workers , including /opt/kube and /root/.kube. After that，launcher use its own /opt/kube/kubectl. Same for kube config.

Dec 12 '19 04:12 antshuwen

What is deleting the mount paths? You mentioned cluster spec, is this an issue with the TensorFlow version you are using? It’s weird that this only happens to some of your workers.

Dec 12 '19 09:12 terrytangyuan

mpi-operator mpi-operator copied to clipboard

when worker num is higher，launcher failed as "kubectl not found"

mpi-operator
mpi-operator copied to clipboard