mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

when worker num is higher,launcher failed as "kubectl not found"

Open antshuwen opened this issue 5 years ago • 4 comments

we have been troubled by a wired problem. When worker num is lower,everything is ok. But when worker num is higher,in our case 70,launcher failed as "kubect not found". The belowing is detail logs:

image

We consulted experts and got some suggestions. Finally we put kubectl and config in the right dir of our image in advance,everything changes right. We are still confused. Welcome to discuss.

antshuwen avatar Dec 11 '19 11:12 antshuwen

Also found the same issue in our cluster...

ywskycn avatar Dec 11 '19 16:12 ywskycn

It's possible that deliver_kubectl.sh is not being executed successfully (or finished) on each worker so that /opt/kube/kubectl may not exist yet. If you copy kubectl to /opt/kube/kubectl in advance in this Dockerfile (similar to my offline suggestion to you earlier), this problem should be fixed.

terrytangyuan avatar Dec 11 '19 17:12 terrytangyuan

It's possible that deliver_kubectl.sh is not being executed successfully (or finished) on each worker so that /opt/kube/kubectl may not exist yet. If you copy kubectl to /opt/kube/kubectl in advance in this Dockerfile (similar to my offline suggestion to you earlier), this problem should be fixed.

We found that all the mount path deleted during sending clusterspec to workers , including /opt/kube and /root/.kube. After that,launcher use its own /opt/kube/kubectl. Same for kube config.

antshuwen avatar Dec 12 '19 04:12 antshuwen

What is deleting the mount paths? You mentioned cluster spec, is this an issue with the TensorFlow version you are using? It’s weird that this only happens to some of your workers.

terrytangyuan avatar Dec 12 '19 09:12 terrytangyuan