ompi Unable to use mpirun over IPv6 with Horovod &NCCL

I am trying to run multinode neural training over IPv6. Our training scripts used to use a Docker Swarm overlay IPv4 network that was unreliable and hard to set up. I wanted to try macvlan bridge network instead. IPv6 itself works without problems: I can ping6 a container running on one node from a container running on another node, and can SSH from one container into another (running on the same or a different node) without a password. After I solved a problem with using IPv6 addresses with mpirun (https://github.com/open-mpi/ompi/issues/6656), I am now experiencing another issue. When I run a simple program or a Bash script like this:

#!/bin/bash

host=$(hostname)

while true; do
    echo "Hello from ${host}!"
    sleep 1s
done

, everything works fine - I have output:

...
0: [1,10]<stdout>:Hello from 88bdc6a70714!
0: [1,11]<stdout>:Hello from 88bdc6a70714!
0: [1,4]<stdout>:Hello from bee713853e8b!
0: [1,5]<stdout>:Hello from bee713853e8b!
0: [1,12]<stdout>:Hello from 88bdc6a70714!
0: [1,6]<stdout>:Hello from bee713853e8b!
0: [1,7]<stdout>:Hello from bee713853e8b!
0: [1,13]<stdout>:Hello from 88bdc6a70714!
0: [1,14]<stdout>:Hello from 88bdc6a70714!
0: [1,15]<stdout>:Hello from 88bdc6a70714!
0: [1,0]<stdout>:Hello from bee713853e8b!
...

But when I am trying to run a training script that uses Horovod and NVIDIA NCCL, this results in an error:

0: --------------------------------------------------------------------------
0: WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
0: should not happen.
0: 
0: Your Open MPI job may now hang or fail.
0: 
0:   Local host: 3c5ccf2b5461
0:   PID:        58
0:   Message:    connect() to 172.25.0.7:1025 failed
0:   Error:      Operation now in progress (115)
0: --------------------------------------------------------------------------

172.25.0.7 is the IPv4 address of the container which called mpirun, and 3c5ccf2b5461 is the hostname of a peer container running on a different node. So it looks like for some reason OpenMPI running on a peer container is trying to use IPv4 for communication instead of IPv6.

The command line includes --mca btl self,tcp

If I remove "tcp" from the btl list, I have this error:

0: --------------------------------------------------------------------------
0: At least one pair of MPI processes are unable to reach each other for
0: MPI communications.  This means that no Open MPI device has indicated
0: that it can be used to communicate between these processes.  This is
0: an error; Open MPI requires that all MPI processes be able to reach
0: each other.  This error can sometimes be the result of forgetting to
0: specify the "self" BTL.
0: 
0:   Process 1 ([[11282,1],4]) is on host: 919243f56ed6
0:   Process 2 ([[11282,1],0]) is on host: 919243f56ed6
0:   BTLs attempted: self
0: 
0: Your MPI job is now going to abort; sorry.
0: --------------------------------------------------------------------------
0: --------------------------------------------------------------------------
0: MPI_INIT has failed because at least one MPI process is unreachable
0: from another.  This *usually* means that an underlying communication
0: plugin -- such as a BTL or an MTL -- has either not loaded or not
0: allowed itself to be used.  Your MPI job will now abort.
0: 
0: You may wish to try to narrow down the problem;
0: 
0:  * Check the output of ompi_info to see which BTL/MTL plugins are
0:    available.
0:  * Run your application with MPI_THREAD_SINGLE.
0:  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
0:    if using MTL-based communications) to see exactly which
0:    communication plugins were considered and/or discarded.
0: --------------------------------------------------------------------------
0: [919243f56ed6:00037] [[11282,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 513
0: [919243f56ed6:00055] *** An error occurred in MPI_Init_thread
0: [919243f56ed6:00055] *** reported by process [739377153,4]
0: [919243f56ed6:00055] *** on a NULL communicator
0: [919243f56ed6:00055] *** Unknown error
0: [919243f56ed6:00055] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
0: [919243f56ed6:00055] ***    and potentially your MPI job)
0: [2ad8f93ebf8e:00051] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
0: [2ad8f93ebf8e:00051] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
0: [2ad8f93ebf8e:00051] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
0: [2ad8f93ebf8e:00051] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
0: [919243f56ed6:00037] 15 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
0: [919243f56ed6:00037] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
0: [919243f56ed6:00037] 14 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
0: [919243f56ed6:00037] 15 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

If I specify --mca btl ^tcp, then I have the following error:

0: --------------------------------------------------------------------------
0: By default, for Open MPI 4.0 and later, infiniband ports on a device
0: are not used by default.  The intent is to use UCX for these devices.
0: You can override this policy by setting the btl_openib_allow_ib MCA parameter
0: to true.
0: 
0:   Local host:              ea58eb377e68
0:   Local adapter:           mlx5_3
0:   Local port:              1
0: 
0: --------------------------------------------------------------------------
0: --------------------------------------------------------------------------
0: WARNING: There was an error initializing an OpenFabrics device.
0: 
0:   Local host:   ea58eb377e68
0:   Local device: mlx5_3
0: --------------------------------------------------------------------------
0: --------------------------------------------------------------------------
0: At least one pair of MPI processes are unable to reach each other for
0: MPI communications.  This means that no Open MPI device has indicated
0: that it can be used to communicate between these processes.  This is
0: an error; Open MPI requires that all MPI processes be able to reach
0: each other.  This error can sometimes be the result of forgetting to
0: specify the "self" BTL.
0: 
0:   Process 1 ([[29231,1],8]) is on host: 5fc2796b124d
0:   Process 2 ([[29231,1],0]) is on host: ea58eb377e68
0:   BTLs attempted: self vader smcuda
0: 
0: Your MPI job is now going to abort; sorry.
0: --------------------------------------------------------------------------

I also tried to add:

--mca btl_tcp_if_exclude 172.0.0.0/8

to the command, everything just hangs.

What am I doing wrong? Why is this happening when using Horovod & NCCL? Should I use any additional command line arguments? I am using OpenMPI 4.0.1 in Docker containers with Ubuntu 16.04 running on NVIDIA DGX servers.

May 17 '19 22:05 ted-kapustin

There are many things to answer here, but first if you want to use TCP you should definitively not use --mca btl ^tcp or you will be unable to communicate across nodes in AWS (except if you run on a partition with IB).

You need to make sure the version of Open MPI you are using is IPv6 ready. You can check if this is the case using ompi_info | grep IPv6. If not then you need to compile your own Open MPI while providing --enable-ipv6during the configure step.

Finally, you need to have your AWS instances in the same security group and add a rule that would allow your instances to reach one another. In Security Group, Inbound tab (where you already have the SSH port 22 rule) add a security rule with port range 1024 - 65525 with source sg-0253183abf48a80e1.

May 17 '19 22:05 bosilca

The bash script doesn't initialize MPI at all, so maybe an intermediate step would be to run an MPI hello world program (to get rid of the DL framework, Horovod and NCCL). You can find one in the examples/ directory of Open MPI : https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c

Aside from that and as a general comment, SSH working is a good step, but the next one is to permit one process on one container to open a port, pass it to the other container and connect to it. So one thing I do usually is use nc to confirm it works. For ipv6, that would mean launching nc -6 -l 12345 on one side and nc -6 ipv6-hostname 12345 on the other side. If you cannot connect or don't see what you type on one side appear on the other side, then you have a firewall/port filtering issue.

Finally, indeed, you need to configure Open MPI to use the ipv6 address of your interface, not the ipv4 one.

May 17 '19 22:05 sjeaugey

Thanks for comments. I will try all the suggestions. OpenMPI installed in containers was build with IPv6 support.

May 17 '19 23:05 ted-kapustin

@ted-kapustin Were you able to get Open MPI to work in your environment? May we close this issue?

Feb 28 '20 19:02 gpaulsen

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

Feb 16 '24 17:02 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

Mar 01 '24 21:03 github-actions[bot]

ompi ompi copied to clipboard

Unable to use mpirun over IPv6 with Horovod &NCCL

ompi
ompi copied to clipboard