mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

Get the IP/Hostname of rank 0 to all the worker replicas

Open amrragab8080 opened this issue 3 years ago • 1 comments

Hi, I have an app that I am running using the mpi-operator one of the requirements is that the replicas need to know the master ip/hostname of rank0 - I checked in the environment of a kubeflow v1alpha2 mpi-operator job but the individual rank environments do not have the master ip/host of rank0.

any ideas how to get the master ip/host of rank 0 to all subsequent ranks. The closest I can see is this:

[1,0]<stdout>:PMIX_HOSTNAME=kubeflow2-env-worker-0
[1,0]<stdout>:PMIX_VERSION=3.2.2
[1,0]<stdout>:OMPI_COMM_WORLD_RANK=0
[1,0]<stdout>:OMPI_COMM_WORLD_LOCAL_RANK=0
[1,0]<stdout>:OMPI_COMM_WORLD_NODE_RANK=0
[1,0]<stdout>:OMPI_MCA_orte_ess_node_rank=0

so that assumes that the hostname will always be <metadata-name>-worker-0

amrragab8080 avatar Mar 19 '21 19:03 amrragab8080

I believe the rank is assigned after mpirun is called. So there is nothing mpi-operator can do to predict which pod will be rank 1.

For a generic solution, you can add a piece of code into your script that let the rank 0 process to broadcast its ip/hostname to the rest ranks once the process is launched by mpirun.

If you are using openmpi, it seems support rankfile, which you can define the rank assignment by a file. However, in such way, you might need to modify mpi-operator to deliver such a rankfile via ConfigMap. (or you may run a script to convert the /etc/mpi/hostfile into a rankfile.)

zw0610 avatar Mar 21 '21 03:03 zw0610