pytorch-operator MPI backend examples launch processes independently in each pod

https://github.com/kubeflow/pytorch-operator/blob/master/examples/ddp/mnist/gpu/v1alpha2/job_mnist_DDP_GPU.yaml

When launching MPI backend jobs examples above with ENTRYPOINT ["mpirun", "-n", "4", "--allow-run-as-root", "python", "-u", "/opt/pytorch_dist_mnist/mnist_ddp_gpu.py"] in Dockerfile,I expected to do distributed training where it launched 1 process on each pod(totally 4, with 1 master and 3 workers).

However, it seems like it launched 4 processes on each pod and trained independently. Is there anything I misunderstood of this examples?

Oct 24 '18 10:10 jwwandy

@Akado2009

Oct 24 '18 10:10 johnugeorge

@jwwandy, greetings. For now, as far as I know -n option in mpirun species the number of copies of process to run, not just the number of containers/pods.

Nov 20 '18 06:11 Akado2009

@Akado2009 Nice to hear from you. Exactly as you mentioned, the -n option in mpirun species the number of copies of process to run and should schedule them on different MPI nodes(no matter how many there are).

What I'm confused is that it seems like there's no mechanism for current examples to discover pods for openMPI as a MPI cluster for launching process across pods. Each pod functioned as an independent MPI cluster launching its own group of processes.

If the examples are just aimed to launch multiple process independently on each pod, rather than do distributed training across pods, then the current example works well. However, I assume part of usage of kubeflow is to do distributed training across multiple pods, which I can only achieve by adding ssh key after pod created as openmpi documents https://www.open-mpi.org/faq/?category=rsh for now.

Nov 20 '18 07:11 jwwandy

@jwwandy Sorry for the late response, was busy working. But yeah, you're right, this example treats each pod as a separate openMPI cluster.

I was thinking about making an upgraded version of this example, so that it treats you k8s cluster as an openmpi cluster, then you're job is gonna be real distributed.

Nov 20 '18 07:11 Akado2009

@Akado2009 Thanks to make it clear.

Although my current workaround is quite dirty by writing a shell script and downward API to setup all of ssh stuff after pods creation, I think they could(and should) be done by controller.

Generating private ssh key for each pod and broadcast public key to all pods as authorized keys
Adding ssh known_host (Also could disable the feature in ssh_config)
Having a hostfile(with hostname from yaml) for mpirun --hostfile option

Wish these short steps might help some.

Nov 20 '18 08:11 jwwandy

@jwwandy Yes, I agree, that it should be done by the controller. Thank you for your workaround, I am going try to implement this logic inside the controller :)

Nov 20 '18 16:11 Akado2009

Any news about this issue?

Jul 08 '19 08:07 ilchemla

Can mpi-operator solve your issue? What is your use case?

Jul 08 '19 08:07 johnugeorge

/area operator /kind feature /priority p2

Jan 14 '20 21:01 jtfogarty

pytorch-operator pytorch-operator copied to clipboard

MPI backend examples launch processes independently in each pod

pytorch-operator
pytorch-operator copied to clipboard