pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

MPI backend examples launch processes independently in each pod

Open jwwandy opened this issue 7 years ago • 9 comments

https://github.com/kubeflow/pytorch-operator/blob/master/examples/ddp/mnist/gpu/v1alpha2/job_mnist_DDP_GPU.yaml

When launching MPI backend jobs examples above with ENTRYPOINT ["mpirun", "-n", "4", "--allow-run-as-root", "python", "-u", "/opt/pytorch_dist_mnist/mnist_ddp_gpu.py"] in Dockerfile,I expected to do distributed training where it launched 1 process on each pod(totally 4, with 1 master and 3 workers).

However, it seems like it launched 4 processes on each pod and trained independently. Is there anything I misunderstood of this examples?

jwwandy avatar Oct 24 '18 10:10 jwwandy

@Akado2009

johnugeorge avatar Oct 24 '18 10:10 johnugeorge

@jwwandy, greetings. For now, as far as I know -n option in mpirun species the number of copies of process to run, not just the number of containers/pods.

Akado2009 avatar Nov 20 '18 06:11 Akado2009

@Akado2009 Nice to hear from you. Exactly as you mentioned, the -n option in mpirun species the number of copies of process to run and should schedule them on different MPI nodes(no matter how many there are).

What I'm confused is that it seems like there's no mechanism for current examples to discover pods for openMPI as a MPI cluster for launching process across pods. Each pod functioned as an independent MPI cluster launching its own group of processes.

If the examples are just aimed to launch multiple process independently on each pod, rather than do distributed training across pods, then the current example works well. However, I assume part of usage of kubeflow is to do distributed training across multiple pods, which I can only achieve by adding ssh key after pod created as openmpi documents https://www.open-mpi.org/faq/?category=rsh for now.

jwwandy avatar Nov 20 '18 07:11 jwwandy

@jwwandy Sorry for the late response, was busy working. But yeah, you're right, this example treats each pod as a separate openMPI cluster.

I was thinking about making an upgraded version of this example, so that it treats you k8s cluster as an openmpi cluster, then you're job is gonna be real distributed.

Akado2009 avatar Nov 20 '18 07:11 Akado2009

@Akado2009 Thanks to make it clear.

Although my current workaround is quite dirty by writing a shell script and downward API to setup all of ssh stuff after pods creation, I think they could(and should) be done by controller.

  1. Generating private ssh key for each pod and broadcast public key to all pods as authorized keys
  2. Adding ssh known_host (Also could disable the feature in ssh_config)
  3. Having a hostfile(with hostname from yaml) for mpirun --hostfile option

Wish these short steps might help some.

jwwandy avatar Nov 20 '18 08:11 jwwandy

@jwwandy Yes, I agree, that it should be done by the controller. Thank you for your workaround, I am going try to implement this logic inside the controller :)

Akado2009 avatar Nov 20 '18 16:11 Akado2009

Any news about this issue?

ilchemla avatar Jul 08 '19 08:07 ilchemla

Can mpi-operator solve your issue? What is your use case?

johnugeorge avatar Jul 08 '19 08:07 johnugeorge

/area operator /kind feature /priority p2

jtfogarty avatar Jan 14 '20 21:01 jtfogarty