mpi-operator
mpi-operator copied to clipboard
Get the IP/Hostname of rank 0 to all the worker replicas
Hi, I have an app that I am running using the mpi-operator one of the requirements is that the replicas need to know the master ip/hostname of rank0 - I checked in the environment of a kubeflow v1alpha2 mpi-operator job but the individual rank environments do not have the master ip/host of rank0.
any ideas how to get the master ip/host of rank 0 to all subsequent ranks. The closest I can see is this:
[1,0]<stdout>:PMIX_HOSTNAME=kubeflow2-env-worker-0
[1,0]<stdout>:PMIX_VERSION=3.2.2
[1,0]<stdout>:OMPI_COMM_WORLD_RANK=0
[1,0]<stdout>:OMPI_COMM_WORLD_LOCAL_RANK=0
[1,0]<stdout>:OMPI_COMM_WORLD_NODE_RANK=0
[1,0]<stdout>:OMPI_MCA_orte_ess_node_rank=0
so that assumes that the hostname will always be <metadata-name>-worker-0
I believe the rank is assigned after mpirun
is called. So there is nothing mpi-operator can do to predict which pod will be rank 1.
For a generic solution, you can add a piece of code into your script that let the rank 0 process to broadcast its ip/hostname to the rest ranks once the process is launched by mpirun
.
If you are using openmpi, it seems support rankfile
, which you can define the rank assignment by a file. However, in such way, you might need to modify mpi-operator to deliver such a rankfile via ConfigMap. (or you may run a script to convert the /etc/mpi/hostfile
into a rankfile.)