ompi TCP BTL fails to collect all interface addresses (when interfaces are on different subnets)

This issue is related to: #5818

I am encountering

Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          nid002292
  Local PID:           1273838
  Peer hostname:       nid002293 ([[9279,0],1])
  Source IP of socket: 10.249.13.210
  Known IPs of peer:
	10.100.20.22
	128.55.69.127
	10.249.13.209
	10.249.36.5
	10.249.34.5

when running the CP2K container (https://catalog.ngc.nvidia.com/orgs/hpc/containers/cp2k) on NERSC's Perlmutter system(https://docs.nersc.gov/systems/perlmutter/architecture/). We've tried OpenMPI v4.1.2rc2 and v4.1.5

Background

Perlmutter's GPU nodes have 4 NICS, each has a private IP address, and one NIC (the one corresponding to the hsn0 interface has an additional public IP addres -- therefore each node has one NIC with two addresses, and these addresses are in different subnets. Eg.:

blaschke@nid200257:~> ip -4 -f inet addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: nmn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    altname enp195s0
    inet 10.100.108.32/22 brd 10.100.111.255 scope global nmn0
       valid_lft forever preferred_lft forever
3: hsn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp194s0
    inet 10.249.41.248/16 brd 10.249.255.255 scope global hsn0
       valid_lft forever preferred_lft forever
    inet 128.55.84.128/19 brd 128.55.95.255 scope global hsn0:chn
       valid_lft forever preferred_lft forever
4: hsn1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp129s0
    inet 10.249.41.232/16 brd 10.249.255.255 scope global hsn1
       valid_lft forever preferred_lft forever
5: hsn2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp66s0
    inet 10.249.41.231/16 brd 10.249.255.255 scope global hsn2
       valid_lft forever preferred_lft forever
6: hsn3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp1s0
    altname ens3
    inet 10.249.41.247/16 brd 10.249.255.255 scope global hsn3
       valid_lft forever preferred_lft forever

(the example above also shows the node management network nmn interace -- but MPIshouldn't be talking to that anyway)

I think the error must be caused by hsn0's two ip addresses on two subnets.

Jan 13 '24 00:01 JBlaschke

I'm a little confused about what you are trying to do.

Your description of the problem lists 6 private IP addresses and 1 public IP address:

lo: 127.0.0.1/8
nm0: 10.100.108.32/22
hsn0: 10.249.41.248/16 and 128.55.85.128/19
hsn1: 10.249.41.232/16
hsn2: 10.249.41.231/16
hsn3: 10.249.41.247/16

A few questions:

What interface / IP address do you want Open MPI to use?
Why are all the hsnX interfaces on the same 10.249.x.y/16 subnet? That is... confusing, and may not be what you really want.
How does the container networking come into play here? I.e., what are the interfaces and IP addresses of the container(s) involved? Overlay networks add another whole level of complexity if that networking is different than the host networking.

Jan 16 '24 17:01 jsquyres

@jsquyres thanks for getting back to me

What interface / IP address do you want Open MPI to use?

I am not familiar with the logic that OpenMPI's TCP BTL uses to bind to networks interfaces. Since each node has one NIC/GPU and NUMA node, the ideal case would be for OpenMPI to use the "nearest" NIC with respect to the rank's NUMA domain. That implies an optimal choice of interface for each rank. It doesn't matter what address it uses.

Why are all the hsnX interfaces on the same 10.249.x.y/16 subnet? That is... confusing, and may not be what you really want.

I don't see why we would want 4 subnets. We have 4 NICS per node (and therefore 4 hsnX interfaces) because there are 4 GPUs per node. This way we can do things like GPU-Direct MPI traffic between GPUs on different nodes without traffic from different GPUs interfering. This is not helped by NICs being on separate subnets.

More broadly speaking: You can think of Perlmutter's high-speed network to be a single private network, and we've given each node a "line to the outside world" by piggy-backing off the hsn0's NIC by giving it a public IP.

Based on the error message:

  Local host:          nid002292
  Local PID:           1273838
  Peer hostname:       nid002293 ([[9279,0],1])
  Source IP of socket: 10.249.13.210
  Known IPs of peer:
	10.100.20.22
	128.55.69.127
	10.249.13.209
	10.249.36.5
	10.249.34.5

I think what's happening is that OpenMPI is unaware that 10.249.13.210 is also a valid IP address for nid002293. This is speculation now: the fact that I see a 128.55.X.X address in this list, (and not a 10.249.X.X one in its place) is that OpenMPI recorded the 128.55.X.X address as "the" address of hsn0. Yet whichever rank on nid002293 that tried to initiate the connection must have used the 10.249.X.X address. I don't know the logic that OpenMPI uses, but maybe it's prioritizing different networks when collecting valid IPs compared to when it is initiating connections.

Perhaps the best solution is to record all IPs (even if they belong to different subnets)?

How does the container networking come into play here? I.e., what are the interfaces and IP addresses of the container(s) involved? Overlay networks add another whole level of complexity if that networking is different than the host networking.

There is no overlay network here -- this is not Kubernetes. We're running Podman-HPC (https://github.com/NERSC/podman-hpc) using Slurm. The container sees the host network.

Jan 16 '24 17:01 JBlaschke

Your description of the problem lists 6 private IP addresses and 1 public IP address

Oh one more thing: of the interfaces listed in my example: OpenMPI definitely should not use nmn0 -- that network is for node management only (communicating over it will be slow, and can cause problems for other users and the sysadmins)

Jan 16 '24 21:01 JBlaschke

Since each node has one NIC/GPU and NUMA node, the ideal case would be for OpenMPI to use the "nearest" NIC with respect to the rank's NUMA domain.

I don't know how to parse this sentence; it seems to contradict itself. The first part of the sentence says that there's one NIC/GPU and NUMA node, but then the second part of the sentence implies that the Open MPI process should use the nearest NIC according to its NUMA domain.

Later in the text, you specifically mention that there are 4 NICs and 4 GPUs; I infer that this means that there are 4 NUMA domains, too.

I don't see why we would want 4 subnets.

I think you should check into how the Linux kernel handles IP traffic from multiple interfaces that are all on the same subnet. It's been a little while since I've poked around in that area, but it used to be true that all outgoing traffic to that subnet would go out the "first" interface on that subnet. I.e., all your outgoing traffic -- for that subnet -- would go out a single NIC. Perhaps the Linux kernel IP stack has gotten smarter about this over time, but I think this kind of use case is not(or at least: has not historically been) well represented in the Linux kernel IP space. Multiple interfaces on the same subnet were more intended to be used for bonding and the like, not NUMA-friendly transfers.

Also, this begs the larger question: why are you using TCP? Assuming you're using NVIDIA NICs and GPUs, shouldn't you be using UCX? UCX will handle all the interface selection and pinning, etc. It will also handle all the RDMA and GPUDirect stuff (which TCP won't).

Jan 16 '24 21:01 jsquyres

UCX will handle all the interface selection and pinning, etc.

Except for the OOB subsystem, of course.

Jan 16 '24 22:01 rhc54

Later in the text, you specifically mention that there are 4 NICs and 4 GPUs; I infer that this means that there are 4 NUMA domains, too.

Oops, my response got mangled in edits. Apologies. Each node on Perlmutter has 1 CPU, 4 GPUs, and 4 NICs. The NICs and GPUs each are on a single NUMA node. (https://docs.nersc.gov/systems/perlmutter/architecture/) So yes, you have 4 NUMA domains, each with their own GPU and NIC.

Here's lstopo: gpu_node

Jan 16 '24 22:01 JBlaschke

I think you should check into how the Linux kernel handles IP traffic from multiple interfaces that are all on the same subnet. It's been a little while since I've poked around in that area, but it used to be true that all outgoing traffic to that subnet would go out the "first" interface on that subnet. I.e., all your outgoing traffic -- for that subnet -- would go out a single NIC. Perhaps the Linux kernel IP stack has gotten smarter about this over time, but I think this kind of use case is not(or at least: has not historically been) well represented in the Linux kernel IP space. Multiple interfaces on the same subnet were more intended to be used for bonding and the like, not NUMA-friendly transfers.

Clearly, you can bind to multiple interfaces, and send traffic via those ... I've done that in plenty of applications. So I suspect that the kernel's IP stack has gotten smarter.

Anyway, you might be right. But HPE won't change the network architecture of their flagship high-speed network over this. So I think speculating over how to arrange interfaces and subnets is moot -- seeing that we won't be able to change that.

Jan 16 '24 23:01 JBlaschke

Assuming you're using NVIDIA NICs and GPUs, shouldn't you be using UCX? UCX will handle all the interface selection and pinning, etc. It will also handle all the RDMA and GPUDirect stuff (which TCP won't).

It's Nvidia GPUs and HPE NICs -- again here's the link: https://docs.nersc.gov/systems/perlmutter/architecture/. Also since HPE bought Cray, I don't know how well supported UCX is right now (with respect to Casini NICs. FTR: I like UCX, and would prefer to use it too).

HPE provides libfabric, built against their CXI (https://docs.open-mpi.org/en/main/tuning-apps/networking/ofi.html#what-are-the-libfabric-ofi-components-in-open-mpi) and their own MPI implementation.

That's all the official support from the vendor that I am aware of...

Jan 16 '24 23:01 JBlaschke

Also, this begs the larger question: why are you using TCP?

Ah! Fair point. We are straying very far from the original point of this issue though. But for the sake of completeness, I will round out this discussion:

My goal is to help users be productive on our systems. Almost always that means minimizing the time it takes to have meaningful scientific data. Depending on the user's problem that can involve everything from performance tuning at scale, or just getting a pre-built executable to work at all.

So, if raw performance was my goal, right now I would be using the vendor's recommended transport library (which in this case would be Cray MPICH) -- even if I'm partial to OpenMPI 😉 Often though, some users do not have the wherewithal to compile a large application using a system-specific toolchain.

In an ideal world ABIs and standards would be mature enough that things like CXI could be resolved dynamically, and MPI "just works" -- in that case, the vendor could just provide something like and OFI provider, we would provide sensible configurations, and an application built against any MPI implementation would pick those up.

For now, this reality isn't here (yet). In its absence, TCP has established itself as the least common denominator that usually "just works". For many users, that is enough. (or at least the performance gains are not worth the effort of rebuilding their applications).

We are working with the CP2K developers, as well as Nvidia, to build a CP2K container that uses HPE CXI. I am happy to discuss this -- and other HPE-related topics -- further. Eventually though, I would like to return to the original point of this issue: that OpenMPI's TCP-based BTL is not detecting all interface addresses.

Jan 16 '24 23:01 JBlaschke

Yeah, Slingshot NICs means you cannot use UCX - must use libfabric. However, I believe OMPI does support Slingshot operations - in fact, last I checked OMPI v5 is running on Perlmutter.

I'm curious, though - is mpirun finding the right interface for its OOB subsystem? Or is that also a problem?

Jan 16 '24 23:01 rhc54

However, I believe OMPI does support Slingshot operations - in fact, last I checked OMPI v5 is running on Perlmutter.

Correct! But that requires OpenMPI to be built with libfabric support (and AFAIK, it needs to be the right version of libfabric -- I might be wrong though). These have to be configured at compile time, and are missing from the Nvidia container. We are working with Nvidia (who built https://catalog.ngc.nvidia.com/orgs/hpc/containers/cp2k ) to see if they can update their image (and upgrade to OMPI v5 while their at it).

I'm curious, though - is mpirun finding the right interface for its OOB subsystem? Or is that also a problem?

We're running the application with Slurm and PMI2: srun --mpi pmi2 .... My knowledge is limited in this area, I always assumed that this means OMPI using PMI2 to wire up the coms, and is not using its OOB subsystem?

Jan 16 '24 23:01 JBlaschke

We're running the application with Slurm and PMI2: srun --mpi pmi2 .... My knowledge is limited in this area, I always assumed that this means OMPI using PMI2 to wire up the coms, and is not using its OOB subsystem?

Correct. However, you have to use PMIx with OMPI v5 (no pmi2 support any more).

see if they can update their image (and upgrade to OMPI v5 while their at it)

Those folks at NVIDIA are rather fixated on UCX, so that could be a problem convincing them. Truly wish you luck on it! 😄

Jan 17 '24 00:01 rhc54

Correct. However, you have to use PMIx with OMPI v5 (no pmi2 support any more).

Good point, another reason to bump the container's OMPI version up to v5

Those folks at NVIDIA are rather fixated on UCX, so that could be a problem convincing them. Truly wish you luck on it! 😄

Thanks!!! Vendors who each prefer their favorite solutions?! Who would that thought? 😉 Seriously thought, I've always found some folks that will listen, so this is not a lost cause 🤞

Jan 17 '24 00:01 JBlaschke

Occurs to me: @hppritcha Would it make sense to post a Docker container with OMPI v5 built against libfabric to the Docker registry? Not sure what else is in the NVIDIA offering, but might help people get around the UCX-only issue.

Jan 17 '24 00:01 rhc54

That would be much appreciated.

Jan 17 '24 00:01 JBlaschke

@JBlaschke the IP you posted do not match the error message, so this is not ideal for troubleshooting.

Can you confirm you start an Open MPI 4 application with srun (instead of mpirun)? does the error appears with only two ranks on two nodes? if so, could you please post the output of ip -4 -f inet addr show on these two nodes with the output of srun ...? Do you pass some btl/tcp settings either with the environment variable (e.g. OMPI_MCA_btl_tcp_xyz=abc) or the openmpi-mca-params.conf file (e.g. btl_tcp_xyz = abc)?

Jan 17 '24 10:01 ggouaillardet

@rhc54 thanks for the Docker container suggestion. isn't a substitute for understanding and hopefully addressing this TCP issue. @JBlaschke do you know if this is reproducible outside of the container? I will try to reproduce myself on PM, but outside of a container.

Jan 17 '24 17:01 hppritcha

I guess this will wait as PM is undergoing maintenance today.

Jan 17 '24 18:01 hppritcha

I guess this will wait as PM is undergoing maintenance today.

@hppritcha Our test system is up -- it has the same hardware configuration as Perlmutter. I'll run @ggouaillardet 's test now and post it here

Do you know if this is reproducible outside of the container? I will try to reproduce myself on PM, but outside of a container.

@hppritcha I don't know -- I haven't built CP2K. Last time I looked at the build script, it was rather involved. If you have a working build (or even if you have a way to get one quickly), then maybe it might be best for you to test that. Where you thinking of using the openmpi/5.0.0 module? How do you configure a pre-installed OMPI to use tcp (even if libfabric is available)? My test script (that uses shifter, but you can easily adapt that to your own build) is here: /global/cfs/cdirs/nstaff/servicenow/INC0214313

Jan 17 '24 19:01 JBlaschke

@JBlaschke i pinged you on NERSC slack for some info.

Jan 17 '24 19:01 hppritcha

Can you confirm you start an Open MPI 4 application with srun (instead of mpirun)? does the error appears with only two ranks on two nodes?

@ggouaillardet Confirming that this is an OMPI v4 application, with srun (not mpirun). For reference, I posted the Slrum jobscript below. I have tried the following: 2 ranks on a single node; 2 ranks on 2 nodes. I have tried nothing else. When running 2 ranks on a single node, I encounter a segfault in UCX -- but that might be because the CP2K container can't run in that way (I am not a CP2K expert).

The IP you posted do not match the error message, so this is not ideal for troubleshooting. [...] could you please post the output of ip -4 -f inet addr show on these two nodes with the output of srun ...?

@ggouaillardet Good point, here the output of ip-4-f intet addr show for each node, prepended to the OMPI error (see slurm script below for precise commands):

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: nmn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    altname enp195s0
    inet 10.100.0.13/22 brd 10.100.3.255 scope global nmn0
       valid_lft forever preferred_lft forever
3: hsn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp194s0
    inet 10.250.0.153/16 brd 10.250.255.255 scope global hsn0
       valid_lft forever preferred_lft forever
    inet 128.55.173.29/24 brd 128.55.173.255 scope global hsn0:chn
       valid_lft forever preferred_lft forever
4: hsn1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp129s0
    inet 10.250.0.154/16 brd 10.250.255.255 scope global hsn1
       valid_lft forever preferred_lft forever
5: hsn2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp66s0
    inet 10.250.1.119/16 brd 10.250.255.255 scope global hsn2
       valid_lft forever preferred_lft forever
6: hsn3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp1s0
    altname ens3
    inet 10.250.1.128/16 brd 10.250.255.255 scope global hsn3
       valid_lft forever preferred_lft forever
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: nmn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    altname enp195s0
    inet 10.100.0.36/22 brd 10.100.3.255 scope global nmn0
       valid_lft forever preferred_lft forever
3: hsn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp194s0
    inet 10.250.0.39/16 brd 10.250.255.255 scope global hsn0
       valid_lft forever preferred_lft forever
    inet 128.55.173.30/24 brd 128.55.173.255 scope global hsn0:chn
       valid_lft forever preferred_lft forever
4: hsn1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp129s0
    inet 10.250.0.40/16 brd 10.250.255.255 scope global hsn1
       valid_lft forever preferred_lft forever
5: hsn2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp66s0
    inet 10.250.1.102/16 brd 10.250.255.255 scope global hsn2
       valid_lft forever preferred_lft forever
6: hsn3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp1s0
    altname ens3
    inet 10.250.1.111/16 brd 10.250.255.255 scope global hsn3
       valid_lft forever preferred_lft forever

--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          nid001033
  Local PID:           2203573
  Peer hostname:       nid001036 ([[55550,1],1])
  Source IP of socket: 10.250.0.39
  Known IPs of peer:
	10.100.0.36
	128.55.173.30
	10.250.0.40
	10.250.1.102
	10.250.1.111
--------------------------------------------------------------------------

It looks like this interface is the problem:

3: hsn0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 10000
    altname enp194s0
    inet 10.250.0.39/16 brd 10.250.255.255 scope global hsn0
       valid_lft forever preferred_lft forever
    inet 128.55.173.30/24 brd 128.55.173.255 scope global hsn0:chn
       valid_lft forever preferred_lft forever

Do you pass some btl/tcp settings either with the environment variable (e.g. OMPI_MCA_btl_tcp_xyz=abc) or the openmpi-mca-params.conf file (e.g. btl_tcp_xyz = abc)?

@ggouaillardet I do not set any OMPI_MCA_btl_tcp* env vars. This is the full jobscrip:

#!/usr/bin/env bash

#!/bin/bash
#SBATCH --image docker:nvcr.io/hpc/cp2k:v2023.1
#SBATCH --nodes 2
#SBATCH --cpus-per-task 128
#SBATCH --gpus-per-task 4
#SBATCH --ntasks-per-node 1
#SBATCH --constraint gpu
#SBATCH --qos debug
#SBATCH -t 00:20:00
#SBATCH -A nstaff
#SBATCH -J cp2k

export OMP_NUM_THREADS=1

srun -n 2 ip -4 -f inet addr show
srun -n 2 --cpu-bind cores --mpi pmi2 shifter --module gpu --entrypoint cp2k -i initial_2.inp -o initial_13.out

Jan 17 '24 20:01 JBlaschke

I suspect the problem is that one proc uses the 10.250.0.39 interface, and the other proc uses the 128.55 address - there is no reason for either of them to prefer one address over the other. What happens if you put OMPI_MCA_btl_tcp_if_include=10.250.0.0/24 in the environment?

EDIT: fixed the environment variable name

Jan 17 '24 20:01 rhc54

If this does not work, you might also want to try to put this instead in your environment (and double check this is passed to the MPI tasks by srun) OMPI_MCA_btl_tcp_if_exclude=hsn0

Jan 18 '24 01:01 ggouaillardet

Huh! This is funny:

@rhc54 setting export OMPI_MCA_btl_tcp_if_include=10.249.0.0/16 (the private IP range on PM -- what I poster earlier is the range on Muller, which I was using while PM as undergoing maintenance), I get the same error:

Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          nid001257
  Local PID:           610010
  Peer hostname:       nid001256 ([[59828,1],0])
  Source IP of socket: 10.249.5.171
  Known IPs of peer:
	128.55.65.113
	10.249.5.172
	10.249.25.201
	10.249.23.201

But the job runs (producing an output file that has reasonable-looking contents). I cannot tell if it runs to completion, or if it deadlocks before then.

@ggouaillardet setting OMPI_MCA_btl_tcp_if_exclude=hsn0 I get a different error:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: nid001608
  PID:        874754

and the program does not produce any output.

In both cases the program runs out the wallclock, so something is dead-locked. It's possible that the OMPI_MCA_btl_tcp_if_include approach deadlocks during finalization only (i.e. the program technically runs).

Jan 18 '24 19:01 JBlaschke

I saw a few suspicious things. As pointed by Ralph, Open MPI is supposed to use only one IP per physical interface, so that kind of error can occur if one node picks hsn0 (in 10.0.0/8) and the other one picks hsn0:chn (in 128.0.0.0/8)

first, just sets OMPI_MCA_btl_base_verbose=100 in your environment and srun. if you built Open MPI with --enable-debug we should get some logs.

Then I suggest you apply the inline patch below and rebuild with --enable-debug

then you can set OMPI_MCA_btl_tcp_if_exclude=hsn0:chn in your environment and srun ... If it does not work, also sets OMPI_MCA_btl_base_verbose=100 and share the logs.

diff --git a/opal/mca/btl/tcp/btl_tcp.h b/opal/mca/btl/tcp/btl_tcp.h
index 846ee3b..acb4af6 100644
--- a/opal/mca/btl/tcp/btl_tcp.h
+++ b/opal/mca/btl/tcp/btl_tcp.h
@@ -172,6 +172,7 @@ struct mca_btl_tcp_module_t {
     struct sockaddr_storage tcp_ifaddr_6; /**< First IPv6 address discovered for this interface, bound as sending address for this BTL  */
 #endif
     uint32_t           tcp_ifmask;  /**< BTL interface netmask */
+    char               tcp_ifname[32];
 
     opal_mutex_t       tcp_endpoints_mutex;
     opal_list_t        tcp_endpoints;
diff --git a/opal/mca/btl/tcp/btl_tcp_component.c b/opal/mca/btl/tcp/btl_tcp_component.c
index 78dee89..012c025 100644
--- a/opal/mca/btl/tcp/btl_tcp_component.c
+++ b/opal/mca/btl/tcp/btl_tcp_component.c
@@ -505,6 +505,7 @@ static int mca_btl_tcp_create(int if_kindex, const char* if_name)
 
         /* initialize the btl */
         btl->tcp_ifkindex = (uint16_t) if_kindex;
+        strcpy(btl->tcp_ifname, if_name);
 #if MCA_BTL_TCP_STATISTICS
         btl->tcp_bytes_recv = 0;
         btl->tcp_bytes_sent = 0;
@@ -512,7 +513,7 @@ static int mca_btl_tcp_create(int if_kindex, const char* if_name)
 #endif
 
        struct sockaddr_storage addr;
-       opal_ifkindextoaddr(if_kindex, (struct sockaddr*) &addr,
+       opal_ifnametoaddr(if_name, (struct sockaddr*) &addr,
                                           sizeof (struct sockaddr_storage));
 #if OPAL_ENABLE_IPV6
         if (addr.ss_family == AF_INET6) {
@@ -816,6 +817,10 @@ static int mca_btl_tcp_component_create_instances(void)
             }
             /* if this interface was not found in the excluded list, create a BTL */
             if(argv == 0 || *argv == 0) {
+                
+                opal_output_verbose(30, opal_btl_base_framework.framework_output,
+                                    "btl:tcp: Creating instance with interface %d %s",
+                                    if_index, if_name);
                 mca_btl_tcp_create(if_index, if_name);
             }
         }
@@ -1175,6 +1180,9 @@ static int mca_btl_tcp_component_exchange(void)
                  }
 
                  opal_ifindextoname(index, ifn, sizeof(ifn));
+                 if (0 != strcmp(ifn, mca_btl_tcp_component.tcp_btls[i]->tcp_ifname)) {
+                     continue;
+                 }
                  opal_output_verbose(30, opal_btl_base_framework.framework_output,
                                      "btl:tcp: examining interface %s", ifn);
                  if (OPAL_SUCCESS !=
@@ -1218,7 +1226,7 @@ static int mca_btl_tcp_component_exchange(void)
                          opal_ifindextokindex (index);
                      current_addr++;
                      opal_output_verbose(30, opal_btl_base_framework.framework_output,
-                                         "btl:tcp: using ipv6 interface %s", ifn);
+                                         "btl:tcp: using ipv6 interface %s with address %s and ifkindex %d", ifn, opal_net_get_hostname((struct sockaddr*)&my_ss), addrs[current_addr].addr_ifkindex);
                  }
              } /* end of for opal_ifbegin() */
          } /* end of for tcp_num_btls */
diff --git a/opal/mca/btl/tcp/btl_tcp_endpoint.c b/opal/mca/btl/tcp/btl_tcp_endpoint.c
index e69cd86..b1d52dd 100644
--- a/opal/mca/btl/tcp/btl_tcp_endpoint.c
+++ b/opal/mca/btl/tcp/btl_tcp_endpoint.c
@@ -752,8 +752,9 @@ static int mca_btl_tcp_endpoint_start_connect(mca_btl_base_endpoint_t* btl_endpo
     }
 #endif
     opal_output_verbose(10, opal_btl_base_framework.framework_output,
-                        "btl: tcp: attempting to connect() to %s address %s on port %d",
+                        "btl: tcp: attempting to connect() to %s from %s address %s on port %d",
                         OPAL_NAME_PRINT(btl_endpoint->endpoint_proc->proc_opal->proc_name),
+                        opal_net_get_hostname((struct sockaddr*) &btl_endpoint->endpoint_btl->tcp_ifaddr),
                         opal_net_get_hostname((struct sockaddr*) &endpoint_addr),
                         ntohs(btl_endpoint->endpoint_addr->addr_port));
 
diff --git a/opal/mca/btl/tcp/btl_tcp_proc.c b/opal/mca/btl/tcp/btl_tcp_proc.c
index c7ee66b..952a327 100644
--- a/opal/mca/btl/tcp/btl_tcp_proc.c
+++ b/opal/mca/btl/tcp/btl_tcp_proc.c
@@ -335,6 +335,9 @@ static mca_btl_tcp_interface_t** mca_btl_tcp_retrieve_local_interfaces(mca_btl_t
         }
         if (true == skip) {
             /* This interface is not part of the requested set, so skip it */
+            opal_output_verbose(20, opal_btl_base_framework.framework_output,
+                                "btl:tcp: skipping local interface %s",
+                                local_if_name);
             continue;
         }
 
@@ -344,6 +347,9 @@ static mca_btl_tcp_interface_t** mca_btl_tcp_retrieve_local_interfaces(mca_btl_t
         /* create entry for this kernel index previously not seen */
         if (OPAL_SUCCESS != rc) {
             index = proc_data->num_local_interfaces++;
+            opal_output_verbose(20, opal_btl_base_framework.framework_output,
+                                "btl:tcp: adding local interface %d/%d %s with kindex %d",
+                                index, proc_data->num_local_interfaces, local_if_name, kindex);
             opal_hash_table_set_value_uint32(&proc_data->local_kindex_to_index, kindex, (void*)(uintptr_t) index);
 
             if( proc_data->num_local_interfaces == proc_data->max_local_interfaces ) {
@@ -356,6 +362,10 @@ static mca_btl_tcp_interface_t** mca_btl_tcp_retrieve_local_interfaces(mca_btl_t
             proc_data->local_interfaces[index] = (mca_btl_tcp_interface_t *) malloc(sizeof(mca_btl_tcp_interface_t));
             assert(NULL != proc_data->local_interfaces[index]);
             mca_btl_tcp_initialise_interface(proc_data->local_interfaces[index], kindex, index);
+        } else {
+            opal_output_verbose(20, opal_btl_base_framework.framework_output,
+                                "btl:tcp: already added local interface %s with kindex %d",
+                                local_if_name, kindex);
         }
 
         local_interface = proc_data->local_interfaces[index];
@@ -551,6 +561,8 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t* btl_proc,
     for( i = 0; i < proc_data->num_local_interfaces; ++i ) {
         mca_btl_tcp_interface_t* local_interface = proc_data->local_interfaces[i];
         for( j = 0; j < proc_data->num_peer_interfaces; ++j ) {
+           opal_output_verbose(20, opal_btl_base_framework.framework_output,
+                                   "btl:tcp: evaluating path from %d/%d to %d/%d", i, proc_data->num_local_interfaces, j, proc_data->num_peer_interfaces);
 
             /*  initially, assume no connection is possible */
             proc_data->weights[i][j] = CQ_NO_CONNECTION;

Jan 19 '24 07:01 ggouaillardet

I think the issue is totally different, and potentially not in OMPI. We need to split the discussion in two: modex exchange and first handshake.

modex exchange: Once OMPI excluded an interface via mca_btl_tcp_if_exclude (hsn0 as in the exchange above), no OMPI process will publish in their modex an IP that matches that interface. However, they might choose to use hsn0:chn instead.
** handshake:** an OMPI process will contact another process using all the available IPs (as needed). We create a local socket, we bind it to one of the IPs published in the modex and then send a connect request to a peer. In your example, an OMPI process will bind it's outgoing end of the socket on hsn0:chn and connect to an IP on hsn0:chn of the peer. However, based on your logs, the peer receive the connection on an IP corresponding to hsn0 (and not hsn0:chn as expected).

So, either we screwed up the connection code and bind the socket to the wrong IP, or the kernel does some tricks and use the first IP on the interface when sending connection requests. Let's check that out.

diff --git a/opal/mca/btl/tcp/btl_tcp_endpoint.c b/opal/mca/btl/tcp/btl_tcp_endpoint.c
index 28138a6b43..298517307f 100644
--- a/opal/mca/btl/tcp/btl_tcp_endpoint.c
+++ b/opal/mca/btl/tcp/btl_tcp_endpoint.c
@@ -791,6 +791,14 @@ static int mca_btl_tcp_endpoint_start_connect(mca_btl_base_endpoint_t *btl_endpo
             CLOSE_THE_SOCKET(btl_endpoint->endpoint_sd);
             return OPAL_ERROR;
         }
+        char tmp[2][16];
+        inet_ntop(AF_INET, &((struct sockaddr_in *)&btl_endpoint->endpoint_btl->tcp_ifaddr)->sin_addr,
tmp[0], 16);
+        inet_ntop(AF_INET, &((struct sockaddr_in *)&endpoint_addr)->sin_addr, tmp[1], 16);
+        opal_output(0, "proc %s bind socket to %s:%d before connecting to peer %s at %s:%d\n",
+            OPAL_NAME_PRINT(OPAL_PROC_MY_NAME),
+            tmp[0], htons(((struct sockaddr_in *) &btl_endpoint->endpoint_btl->tcp_ifaddr)->sin_port),

+            OPAL_NAME_PRINT(btl_endpoint->endpoint_proc->proc_opal->proc_name),
+            tmp[1], ntohs(((struct sockaddr_in *) &endpoint_addr)->sin_port));
     }
 #if OPAL_ENABLE_IPV6
     if (endpoint_addr.ss_family == AF_INET6) {

Jan 19 '24 18:01 bosilca

Sorry for replying late, it's been a busy week.

@JBlaschke ACK on all your points. Thanks for all the detail!

On the original report, I'm still a little confused -- and I think @bosilca is asking the right questions here:

  Local host:          nid002292
  Local PID:           1273838
  Peer hostname:       nid002293 ([[9279,0],1])
  Source IP of socket: 10.249.13.210
  Known IPs of peer:
	10.100.20.22
	128.55.69.127
	10.249.13.209
	10.249.36.5
	10.249.34.5

Is 10.249.13.210 a known IP address on the peer (nid002293)? If so, I'm curious as to why Open MPI doesn't report that as one of the valid IPs for that peer.

@ggouaillardet has a good suggestion: build with --enable-debug and run with OMPI_MCA_btl_base_verbose=100 (and his patch). That will give a bunch of good interface matching information and shed light on Open MPI TCP BTL's decisions.

Jan 19 '24 19:01 jsquyres

Is 10.249.13.210 a known IP address on the peer (nid002293)? If so, I'm curious as to why Open MPI doesn't report that as one of the valid IPs for that peer.

Because we only keep/report one IP per iface. Multiple IPs on the same iface will only confuse the communication balancer, and lead to non optimal communication scheduling.

Jan 19 '24 19:01 bosilca

@bosilca Ah, ok. So should @JBlaschke run with --mca btl_tcp_if_exclude 128.55.85.128/19?

Jan 19 '24 20:01 jsquyres

He should run as on the last reported test with btl_tcp_if_exclude=hsn0 and with nothing. I think the root cause of all the issues reported here is the same, IP mismatch between sender and receiver. Depending on the outcome of these tests we can:

if we are doing the incorrect binding on the sender side: fix it to make sure we bind to the reported IP
if we bind correctly but the peer receive a different subnet address (bound to the same interface) then we will need to exchange all IPs for each interface but restrict the module creation to only use one per physical interface.

Jan 19 '24 20:01 bosilca

ompi ompi copied to clipboard

TCP BTL fails to collect all interface addresses (when interfaces are on different subnets)

Background

ompi
ompi copied to clipboard