ompi icon indicating copy to clipboard operation
ompi copied to clipboard

TCP connectivity problem in OpenMPI 4.1.4

Open gregfi opened this issue 2 years ago • 39 comments

Background information

What version of Open MPI are you using?

4.1.4

Describe how Open MPI was installed

Compiled from source

/openmpi-4.1.4/configure --with-tm=/local/xxxxxxxx/REQ0135770/torque-6.1.1/src --prefix=/tools/openmpi/4.1.4 --without-ucx --without-verbs --with-lsf=/tools/lsf/10.1 --with-lsf-libdir=/tools/lsf/10.1/linux3.10-glibc2.17-x86_64/lib

Please describe the system on which you are running

  • Operating system/version: SLES12-SP3
  • Computer hardware: Intel Xeon class
  • Network type: TCP over Infiniband

Details of the problem

When I try the ring test (ring_c.c) across multiple hosts, I get the following error:

--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: bl2609
  PID:        26165
  Message:    connect() to 9.9.11.33:1048 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------

When I try the same test using OpenMPI 3.1.0, it works without issue. How can I identify and work around the problem?

gregfi avatar Aug 30 '22 20:08 gregfi

Hello. Thanks for submitting an issue.

I'd be curious to see your mpirun command line. I usually use something like mpirun -host <host1>:4,<host2>:4 a.out to run 4 ranks on each node. Of course if you're inside of an LSF or Torque allocation, it may auto-detect your job's allocation and launch that way.

NOTE: your configure option --with-lsf-libdir=/tools/lsf/10.1/lin, I would guess should end in lib instead of lin?

gpaulsen avatar Aug 31 '22 00:08 gpaulsen

Sorry - the configure line got chopped off; I edited the post above to correct it.

Yes, I'm submitting via LSF, so my mpirun line looks something like:

bsub -n 32 -I mpirun /path/to/ring_c

gregfi avatar Aug 31 '22 00:08 gregfi

Is the IP address that it tried to connect to correct (9.9.11.33)?

Also, is there a reason you're using TCP over IB? That is known to be pretty slow compared to native IB protocols. I think early versions of TCP over IB had some reliability issues, too. You might just want to switch to building Open MPI with UCX and let Open MPI use the native IB protocols.

jsquyres avatar Aug 31 '22 17:08 jsquyres

I think the IP address is correct, but there are some connectivity problems. What's puzzling is that OpenMPI 3.1.0 works. Is there any way to see what interface is being used by mpirun?

Yes, UCX would be preferable, but SLES12 is fairly old at this point, and the version of librdmacm that we have on the platform fails at configure time for UCX, so my understanding is that falls back on TCP anyway. (That's why I disabled UCX in the build of OpenMPI.)

gregfi avatar Aug 31 '22 17:08 gregfi

UCX/old SLES: ah, got it. I assume the cost (e.g., in time/resources) to upgrade to a newer OS is too prohibitive.

That being said, it might not be that hard to get new librdmacm + new UCX + Open MPI v4.x to work with UCX/native IB. E.g., if you install all of them into the same installation tree, and ensure that that installation tree appears first in your LD_LIBRARY_PATH (i.e., so that new librdmacm will be found before the OS-installed librdmacm). Or even better yet, if you can fully uninstall all the OS packages needed for IB support, and install a whole new IB stack in an alternate location (e.g., /opt/hpc or wherever Nvidia installs all of its stuff these days -- my point is to not install the libraries and whatnot under /usr/lib, or wherever your OS installs libraries by default). This would mean that there is zero confusion between the OS IB stack and a new/modern IB stack. Open MPI and UCX can definitely work in this kind of scenario, if you're interested in investigating it.

One big disclaimer: I don't follow the SLES distro and the IB software stacks these days; I don't know if there's anything in the SLES 12 kernel, for example, that would explicitly prohibit using new librdmacm / new UCX. E.g., I don't know if you'll need new IB kernel drivers or not.

All that being said, let's talk TCP.

Yes, you can make the TCP BTL be very chatty about what it is doing. Set the MCA parameter btl_base_verbose to 100. For example, mpirun --mca btl_base_verbose 100 .... I don't know the exact syntax for this using bsub. This should make the TCP BTL tell you which IP interface(s) it is using, etc.

jsquyres avatar Aug 31 '22 19:08 jsquyres

Yes, unfortunately, upgrading the OS is a major undertaking and is not an option at this time.

I ran some additional tests with one of our parallel applications on a portion of our cluster that has been partitioned off for investigation of this issue. This portion does not seem to have the TCP connect() error, but it does exhibit another issue that I've seen with OpenMPI 4.1 versus 3.1: considerably more erratic performance.

These jobs all use 16 processes on systems that have 28 slots each, so there is relatively limited communication between hosts - many of the jobs should just be using vader. Here's the performance with OpenMPI 3.1.0:

Execution  time  on  16  processor(s):  9   min,  29.7  sec
Execution  time  on  16  processor(s):  6   min,  49.9  sec
Execution  time  on  16  processor(s):  7   min,  11.8  sec
Execution  time  on  16  processor(s):  7   min,  24.7  sec
Execution  time  on  16  processor(s):  7   min,  12.0  sec
Execution  time  on  16  processor(s):  10  min,  50.0  sec
Execution  time  on  16  processor(s):  7   min,  4.5   sec
Execution  time  on  16  processor(s):  7   min,  20.0  sec
Execution  time  on  16  processor(s):  6   min,  25.8  sec

Here's the same application compiled with OpenMPI 4.1.4:

Execution  time  on  16  processor(s):  15  min,  46.8  sec
Execution  time  on  16  processor(s):  25  min,  14.7  sec
Execution  time  on  16  processor(s):  25  min,  46.3  sec
Execution  time  on  16  processor(s):  13  min,  17.6  sec
Execution  time  on  16  processor(s):  18  min,  41.2  sec
Execution  time  on  16  processor(s):  45  min,  53.3  sec
Execution  time  on  16  processor(s):  20  min,  23.6  sec
Execution  time  on  16  processor(s):  21  min,  26.3  sec
Execution  time  on  16  processor(s):  20  min,  21.1  sec

I've attached outputs generated with --mca pml_base_verbose 100 --mca btl_base_verbose 100. Any idea where I should look to identify the problem here?

3.1.0_pml_btl_verbose.txt

4.1.4_pml_btl_verbose.txt

gregfi avatar Sep 02 '22 00:09 gregfi

Bump. Any thoughts on how to narrow down the problem?

gregfi avatar Sep 13 '22 00:09 gregfi

from the logs, Open MPI 3.1.0 uses both eth0 and ib0, but Open MPI 4.1.4 only uses eth0.

I suggest you try forcing ib0 and see how it goes

mpirun --mca btl_tcp_if_include ib0 ...

ggouaillardet avatar Sep 13 '22 04:09 ggouaillardet

@ggouaillardet is right. But I see that the v4.1.x log is also using the sppp interface -- I don't know what that is offhand.

In both versions of Open MPI, I'd suggest what @ggouaillardet suggested: force the use of ib0. Splitting network traffic over a much-slower eth0 and a much-faster ib0 can have weird performance effects.

Have you tried uninstalling the OS IB stack and installing your own, per my prior comment?

jsquyres avatar Sep 13 '22 14:09 jsquyres

Forcing the use of ib0 with OpenMPI 4.1.4 does not seem to improve the performance.

Part of the dysfunction here may be differing versions of OFED being installed on the build machine as compared to the rest of the cluster. (I'm asking the admins to look into it.) I thought 3.1.0 was using IB via TCP, but that seems to not be correct - I see openib being cited in the in the 3.1.0 verbose output. OpenMPI 3.1.0 may have been compiled at a time prior to this mismatch.

If I force 3.1.0 to use tcp, the performance deteriorates a little bit, but not nearly to the extent seen in 4.1.4. So there still seems to be something causing tcp performance to drag in 4.1.4.

gregfi avatar Sep 13 '22 21:09 gregfi

I just re-read your comments and see this:

These jobs all use 16 processes on systems that have 28 slots each, so there is relatively limited communication between hosts - many of the jobs should just be using vader. Here's the performance with OpenMPI 3.1.0:

Does this mean each run is on a single node, launching MPI processes on 16 out of 28 total cores?

jsquyres avatar Sep 13 '22 21:09 jsquyres

There are three nodes in this special testing queue - each with 28 slots, so 84 slots in total. I submitted ten 16-process jobs to the queue, so about half of them would run entirely within a single node and the other half would be split between nodes.

gregfi avatar Sep 14 '22 00:09 gregfi

Oh, that makes a huge difference.

If an MPI job is running entirely on a single node, it won't use TCP at all: it will use shared memory to communicate on-node. More generally, Open MPI processes will use shared memory (which is significantly faster than both TCP and native IB) to communicate with peers that are on the same node, and will use some kind of network to communicate with peers off-node.

So if your jobs end up having different numbers of on-node / off-node peers, that can certainly explain why there's variations in total execution times.

That being said, it doesn't explain why there's large differences between v3.x and v4.x. It would be good to get some apples-to-apples comparisons between v3.x and v4.x, though. Let's get the network out of the equation, and only test shared memory as an MPI transport. That avoids any questions about IPoIB.

Can you get some timings of all-on-one-node runs with Open MPI v3.x and v4.x?

jsquyres avatar Sep 14 '22 00:09 jsquyres

OK, the machines I was running on got wiped and re-inserted into the general population and some other machines were swapped in to my partition of the network. These new machines are running SLES12-SP5, and the OFED version mismatch issue was sorted out. I re-compiled OpenMPI 4.1.4, and openib seems to be working better... mostly. I still see some messages to the effect of:

[bl3402:01258] rdmacm CPC only supported when the first QP is a PP QP; skipped
[bl3402:01258] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped

I'm not sure what these mean or how catastrophic they are, but the jobs seem to run with --mca btl ^tcp when spanning multiple hosts, so the openib btl seems to be working in some capacity.

With --mca btl vader,self on 4.1.4, I get:

Execution  time  on  12  processor(s):  15  min,  44.9  sec
Execution  time  on  12  processor(s):  14  min,  35.3  sec
Execution  time  on  12  processor(s):  15  min,  31.1  sec
Execution  time  on  12  processor(s):  15  min,  1.0   sec
Execution  time  on  12  processor(s):  14  min,  41.0  sec
Execution  time  on  12  processor(s):  15  min,  26.4  sec
Execution  time  on  12  processor(s):  15  min,  29.7  sec
Execution  time  on  12  processor(s):  15  min,  27.5  sec
Execution  time  on  12  processor(s):  14  min,  42.8  sec

On Version 3.1.0, I get:

Execution  time  on  12  processor(s):  15  min,  35.2  sec
Execution  time  on  12  processor(s):  14  min,  31.4  sec
Execution  time  on  12  processor(s):  15  min,  27.8  sec
Execution  time  on  12  processor(s):  14  min,  28.2  sec
Execution  time  on  12  processor(s):  14  min,  59.4  sec
Execution  time  on  12  processor(s):  15  min,  39.5  sec
Execution  time  on  12  processor(s):  15  min,  30.8  sec
Execution  time  on  12  processor(s):  15  min,  16.6  sec
Execution  time  on  12  processor(s):  14  min,  45.6  sec

Practically equivalent performance. Interestingly, if I run --mca btl ^tcp, I see the same inconsistent performance, with some jobs running very slowly. However, on the last (slowest) job that runs, performance improves dramatically when the other MPI jobs finish. Here are the times (in seconds) for each computational iteration that I see on the last running job:

100.152
101.964
 99.710
101.042
102.910
102.894
102.817
102.995
102.481
102.479
 82.162
 35.575
 35.576
 35.578
 35.599
 35.600
 35.607

Does that suggest some kind of network configuration issue?

gregfi avatar Sep 20 '22 00:09 gregfi

Some clarifying questions:

  • With your shared memory tests, are you running with 12 dedicated cores on a single host (and no other MPI processes on the node at the same time)? If not, can you explain exactly how the jobs are run?
  • If you're able to run with openib, you should probably also be able to run with UCX. Have you tried that?
    • I keep asking about UCX because it is better supported than openib. Indeed, openib is disappearing in the upcoming Open MPI v5.0 -- the UCX PML will effectively be the only way to run on InfiniBand.
  • With your ^tcp tests, are you mixing multiple jobs on the same host at the same time? Your comment about "performance improves dramatically when the other MPI jobs finish" suggests that there might be some overloading occurring -- i.e., multiple MPI processes are being bound to the same core. You might want to run with mpirun --report-bindings to see exactly which core(s) each process is being bound to.

jsquyres avatar Sep 20 '22 13:09 jsquyres

  • Yes, the shared memory tests are running with 12 dedicated cores and no other simultaneous processes.
  • I've gotten the admins to install the UCX devel libraries, and I'm trying the configuration right now. It's an older version (1.4) that's distributed with the OS, but I'm hoping it can be made to work. (I see the warning about 1.8, but hopefully earlier versions are OK.)
  • Yes, with the ^tcp jobs, I'm running 16-process jobs on 12-slot hosts. So the division is host1,host2 = 12,4 or 8,8 depending on the machine. I will rerun with --report-bindings and post the results.

gregfi avatar Sep 20 '22 13:09 gregfi

FYI: You should be able to download and install a later version of UCX yourself (e.g., just install it under your $HOME, such as to $HOME/install/ucx or somesuch). It's a 100% userspace library; there's no special permissions needed. Then you can build Open MPI with ./configure --with-ucx=$HOME/install/ucx ....

jsquyres avatar Sep 20 '22 13:09 jsquyres

Understood, but current UCX does not work with the version of librdmacm from the OS. In principle, I could install a newer version, but it would be far easier if the OS load set could be made to work.

Job #1, which is performing somewhat slowly has:

[bl3403:19505] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3403:19505] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3403:19505] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3403:19505] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3403:19505] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3403:19505] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3403:19505] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3403:19505] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3403:19505] MCW rank 8 bound to socket 1[core 8[hwt 0]]: [./././././.][././B/././.]
[bl3403:19505] MCW rank 9 bound to socket 1[core 9[hwt 0]]: [./././././.][./././B/./.]
[bl3403:19505] MCW rank 10 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.]
[bl3403:19505] MCW rank 11 bound to socket 1[core 11[hwt 0]]: [./././././.][./././././B]
[bl3402:18730] MCW rank 12 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:18730] MCW rank 13 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:18730] MCW rank 14 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:18730] MCW rank 15 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]

Job #2, which is performing very slowly, has:

[bl3402:18717] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:18717] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:18717] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:18717] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3402:18717] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3402:18717] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3402:18717] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3402:18717] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3401:02154] MCW rank 8 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3401:02154] MCW rank 9 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3401:02154] MCW rank 10 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3401:02154] MCW rank 11 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3401:02154] MCW rank 12 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3401:02154] MCW rank 13 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3401:02154] MCW rank 14 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3401:02154] MCW rank 15 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]

Seems like there's overlap, no?

gregfi avatar Sep 20 '22 14:09 gregfi

I forgot about your librdmacm issue. Yes, you could install that manually, too -- it's also a 100% userspace library.

Yes, those 2 jobs definitely overlap -- that's why you're seeing dramatic slowdowns: multiple MPI processes are are being bound to the same core, and therefore they're fighting for cycles.

At this point, I have to turn you back over to @gpaulsen because I don't know how Open MPI reads the LSF job info and decides which cores to use.

jsquyres avatar Sep 20 '22 14:09 jsquyres

If you are running multiple mpirun calls that are receiving the same allocation information, then they will overlap as the don't know about each other. It sounds to me like either an error in your bsub command or a bug in the ORTE internal code that reads the resulting allocation info. If you are saying this worked with OMPI v3.x, I very much doubt the ORTE code changed when going to OMPI v4.x - though someone could easily check the relevant orte/mca/ras component to see.

rhc54 avatar Sep 20 '22 14:09 rhc54

@markalle Can you please take a look?

Perhaps some ORTE verbosity will shed some light on things?

gpaulsen avatar Sep 20 '22 14:09 gpaulsen

What parameters should I set?

gregfi avatar Sep 20 '22 15:09 gregfi

If you have built with --enable-debug, add --mca ras_base_verbose 10 to your mpirun cmd line.

rhc54 avatar Sep 20 '22 17:09 rhc54

Are these jobs running at the same time? If they're not running at the same time then I don't think there's any overlap, they both look like 2-host jobs where

Job 1 is: host bl3403 : 12 ranks host bl3402 : 4 ranks

and Job 2 is: host bl3402 : 8 ranks host bl3401 : 8 ranks

But if they're both bsubed simultaneously and are both trying to use bl3402 at the same time then I see what you're saying about overlap.

I don't actually remember which version of OMPI prints full-host affinity output vs which would only show the cgroup it was handed and the binding relative to that cgroup... when it does the latter it leaves the output kind of unclear looking IMO. My expectation is that if those LSF jobs were running at the same time, then LSF should have handed a different cgroup to each job and those cgroups shouldn't overlap each other.

I think those are probably all full-host affinity displays, but when in doubt I just stick my own function somewhere so I know what it's printing. Eg something like:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>

void
print_affinity()
{
    int i, n;
    char hostname[64];
    char *str;
    cpu_set_t mask;
    n = sysconf(_SC_NPROCESSORS_ONLN);
    sched_getaffinity(0, sizeof(mask), &mask);
    str = malloc(n + 256);
    if (!str) { return; }
    gethostname(hostname, 64);
    sprintf(str, "%s:%d ", hostname, getpid());
    for (i=0; i<n; ++i) {
        if (CPU_ISSET(i, &mask)) {
            strcat(str, "1");
        } else {
            strcat(str, "0");
        }
    }
    printf("%s\n", str);
    free(str);
}

int
main() {
    print_affinity();
    return(0);
}

markalle avatar Sep 20 '22 17:09 markalle

Yes, both jobs were running at the same time.

Re-compiling with --enable-debug to try Ralph's suggestion. I'm not sure I understand @markalle's suggestion

gregfi avatar Sep 20 '22 17:09 gregfi

Job #1: 7*bl3404; 9*bl3402

[bl3404:22417] mca: base: components_register: registering framework ras components
[bl3404:22417] mca: base: components_register: found loaded component lsf
[bl3404:22417] mca: base: components_register: component lsf has no register or open function
[bl3404:22417] mca: base: components_register: found loaded component simulator
[bl3404:22417] mca: base: components_register: component simulator register function successful
[bl3404:22417] mca: base: components_register: found loaded component tm
[bl3404:22417] mca: base: components_register: component tm register function successful
[bl3404:22417] mca: base: components_register: found loaded component slurm
[bl3404:22417] mca: base: components_register: component slurm register function successful
[bl3404:22417] mca: base: components_open: opening ras components
[bl3404:22417] mca: base: components_open: found loaded component lsf
[bl3404:22417] mca: base: components_open: component lsf open function successful
[bl3404:22417] mca: base: components_open: found loaded component simulator
[bl3404:22417] mca: base: components_open: found loaded component tm
[bl3404:22417] mca: base: components_open: component tm open function successful
[bl3404:22417] mca: base: components_open: found loaded component slurm
[bl3404:22417] mca: base: components_open: component slurm open function successful
[bl3404:22417] mca:base:select: Auto-selecting ras components
[bl3404:22417] mca:base:select:(  ras) Querying component [lsf]
[bl3404:22417] mca:base:select:(  ras) Query of component [lsf] set priority to 75
[bl3404:22417] mca:base:select:(  ras) Querying component [simulator]
[bl3404:22417] mca:base:select:(  ras) Querying component [tm]
[bl3404:22417] mca:base:select:(  ras) Querying component [slurm]
[bl3404:22417] mca:base:select:(  ras) Selected component [lsf]
[bl3404:22417] mca: base: close: unloading component simulator
[bl3404:22417] mca: base: close: unloading component tm
[bl3404:22417] mca: base: close: component slurm closed
[bl3404:22417] mca: base: close: unloading component slurm
[bl3404:22417] [[39824,0],0] ras:base:allocate
[bl3404:22417] ras/lsf: New Node (bl3404) [slots=1]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=2]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=3]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=4]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=5]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=6]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=7]
[bl3404:22417] ras/lsf: New Node (bl3402) [slots=1]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=2]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=3]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=4]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=5]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=6]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=7]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=8]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=9]
[bl3404:22417] [[39824,0],0] ras:base:node_insert inserting 2 nodes
[bl3404:22417] [[39824,0],0] ras:base:node_insert updating HNP [bl3404] info to 7 slots
[bl3404:22417] [[39824,0],0] ras:base:node_insert node bl3402 slots 9
[bl3404:22417] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3404:22417] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3404:22417] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3404:22417] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3404:22417] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3404:22417] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3404:22417] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3402:15133] MCW rank 7 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:15133] MCW rank 8 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:15133] MCW rank 9 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:15133] MCW rank 10 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3402:15133] MCW rank 11 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3402:15133] MCW rank 12 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3402:15133] MCW rank 13 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3402:15133] MCW rank 14 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3402:15133] MCW rank 15 bound to socket 1[core 8[hwt 0]]: [./././././.][././B/././.]

Job #2: 12*bl3403; 4*bl3404

[bl3403:16514] mca: base: components_register: registering framework ras components
[bl3403:16514] mca: base: components_register: found loaded component lsf
[bl3403:16514] mca: base: components_register: component lsf has no register or open function
[bl3403:16514] mca: base: components_register: found loaded component simulator
[bl3403:16514] mca: base: components_register: component simulator register function successful
[bl3403:16514] mca: base: components_register: found loaded component tm
[bl3403:16514] mca: base: components_register: component tm register function successful
[bl3403:16514] mca: base: components_register: found loaded component slurm
[bl3403:16514] mca: base: components_register: component slurm register function successful
[bl3403:16514] mca: base: components_open: opening ras components
[bl3403:16514] mca: base: components_open: found loaded component lsf
[bl3403:16514] mca: base: components_open: component lsf open function successful
[bl3403:16514] mca: base: components_open: found loaded component simulator
[bl3403:16514] mca: base: components_open: found loaded component tm
[bl3403:16514] mca: base: components_open: component tm open function successful
[bl3403:16514] mca: base: components_open: found loaded component slurm
[bl3403:16514] mca: base: components_open: component slurm open function successful
[bl3403:16514] mca:base:select: Auto-selecting ras components
[bl3403:16514] mca:base:select:(  ras) Querying component [lsf]
[bl3403:16514] mca:base:select:(  ras) Query of component [lsf] set priority to 75
[bl3403:16514] mca:base:select:(  ras) Querying component [simulator]
[bl3403:16514] mca:base:select:(  ras) Querying component [tm]
[bl3403:16514] mca:base:select:(  ras) Querying component [slurm]
[bl3403:16514] mca:base:select:(  ras) Selected component [lsf]
[bl3403:16514] mca: base: close: unloading component simulator
[bl3403:16514] mca: base: close: unloading component tm
[bl3403:16514] mca: base: close: component slurm closed
[bl3403:16514] mca: base: close: unloading component slurm
[bl3403:16514] [[7874,0],0] ras:base:allocate
[bl3403:16514] ras/lsf: New Node (bl3403) [slots=1]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=2]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=3]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=4]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=5]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=6]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=7]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=8]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=9]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=10]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=11]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=12]
[bl3403:16514] ras/lsf: New Node (bl3404) [slots=1]
[bl3403:16514] ras/lsf: +++ Node (bl3404) [slots=2]
[bl3403:16514] ras/lsf: +++ Node (bl3404) [slots=3]
[bl3403:16514] ras/lsf: +++ Node (bl3404) [slots=4]
[bl3403:16514] [[7874,0],0] ras:base:node_insert inserting 2 nodes
[bl3403:16514] [[7874,0],0] ras:base:node_insert updating HNP [bl3403] info to 12 slots
[bl3403:16514] [[7874,0],0] ras:base:node_insert node bl3404 slots 4
[bl3403:16514] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3403:16514] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3403:16514] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3403:16514] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3403:16514] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3403:16514] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3403:16514] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3403:16514] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3403:16514] MCW rank 8 bound to socket 1[core 8[hwt 0]]: [./././././.][././B/././.]
[bl3403:16514] MCW rank 9 bound to socket 1[core 9[hwt 0]]: [./././././.][./././B/./.]
[bl3403:16514] MCW rank 10 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.]
[bl3403:16514] MCW rank 11 bound to socket 1[core 11[hwt 0]]: [./././././.][./././././B]
[bl3404:22425] MCW rank 12 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3404:22425] MCW rank 13 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3404:22425] MCW rank 14 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3404:22425] MCW rank 15 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]

gregfi avatar Sep 20 '22 18:09 gregfi

Yeah, there is no isolation being provided here by LSF, and so they are going to overlap 100%. Nothing OMPI can do about it. You need to modify your bsub command to ensure isolated allocations.

rhc54 avatar Sep 20 '22 18:09 rhc54

@gregfi What @rhc54 is saying is that LSF is not telling Open MPI specifically which cores to use for each MPI process in the 2 jobs. Hence, Open MPI is just starting with core 0 in the MPI processes in each job. Is there a way you can tell bsub to pass along such info?

All this being said, can you repeat the same test (run 2 jobs simultaneously on the same node) with Open MPI v3.x, and see if --report-bindings shows the same thing? It's curious that Open MPI 3.x jobs don't exhibit the same performance issue; it could be because there's some kind of regression somewhere where 3.x is handling affinity of multiple simultaneous jobs on the same node properly and 4.x is not. Re-running the --report-bindings test with 3.x would be helpful to determine if this is the case.

What @markalle was suggesting as a simple C program that shows more-or-less the same output as --report-bindings, but a bit more explicitly. Specifically, his program shows the affinity of a given process relative to the entire node on which the process is running. He suggested this because he wasn't sure if --report-bindings showed that, or whether it only showed the binding relative to the cgroup (i.e., set of processors) which LSF created for that specific job.

jsquyres avatar Sep 20 '22 18:09 jsquyres

The output looks a little different on 3.1.0. I'm not sure what to make of it. Performance on both jobs is good. Here's Job 1 (12bl3404; 4bl3401):

[bl3401:00591] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00591] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00591] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00591] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 5 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 7 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]

Here's Job 2 (8bl3401; 8bl3402):

[bl3401:00604] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00604] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00604] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 5 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00604] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 7 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]

gregfi avatar Sep 20 '22 18:09 gregfi

Those two jobs are binding to socket, not core, and thus you won't see much difference as the procs are all sharing the processors in each socket anyway. Your prior example output showed mpirun binding each proc to core - hence the performance difference.

rhc54 avatar Sep 20 '22 18:09 rhc54