ompi
ompi copied to clipboard
TCP connectivity problem in OpenMPI 4.1.4
Background information
What version of Open MPI are you using?
4.1.4
Describe how Open MPI was installed
Compiled from source
/openmpi-4.1.4/configure --with-tm=/local/xxxxxxxx/REQ0135770/torque-6.1.1/src --prefix=/tools/openmpi/4.1.4 --without-ucx --without-verbs --with-lsf=/tools/lsf/10.1 --with-lsf-libdir=/tools/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
Please describe the system on which you are running
- Operating system/version: SLES12-SP3
- Computer hardware: Intel Xeon class
- Network type: TCP over Infiniband
Details of the problem
When I try the ring test (ring_c.c) across multiple hosts, I get the following error:
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: bl2609
PID: 26165
Message: connect() to 9.9.11.33:1048 failed
Error: Operation now in progress (115)
--------------------------------------------------------------------------
When I try the same test using OpenMPI 3.1.0, it works without issue. How can I identify and work around the problem?
Hello. Thanks for submitting an issue.
I'd be curious to see your mpirun
command line. I usually use something like mpirun -host <host1>:4,<host2>:4 a.out
to run 4 ranks on each node. Of course if you're inside of an LSF or Torque allocation, it may auto-detect your job's allocation and launch that way.
NOTE: your configure option --with-lsf-libdir=/tools/lsf/10.1/lin
, I would guess should end in lib
instead of lin
?
Sorry - the configure line got chopped off; I edited the post above to correct it.
Yes, I'm submitting via LSF, so my mpirun line looks something like:
bsub -n 32 -I mpirun /path/to/ring_c
Is the IP address that it tried to connect to correct (9.9.11.33)?
Also, is there a reason you're using TCP over IB? That is known to be pretty slow compared to native IB protocols. I think early versions of TCP over IB had some reliability issues, too. You might just want to switch to building Open MPI with UCX and let Open MPI use the native IB protocols.
I think the IP address is correct, but there are some connectivity problems. What's puzzling is that OpenMPI 3.1.0 works. Is there any way to see what interface is being used by mpirun?
Yes, UCX would be preferable, but SLES12 is fairly old at this point, and the version of librdmacm that we have on the platform fails at configure time for UCX, so my understanding is that falls back on TCP anyway. (That's why I disabled UCX in the build of OpenMPI.)
UCX/old SLES: ah, got it. I assume the cost (e.g., in time/resources) to upgrade to a newer OS is too prohibitive.
That being said, it might not be that hard to get new librdmacm + new UCX + Open MPI v4.x to work with UCX/native IB. E.g., if you install all of them into the same installation tree, and ensure that that installation tree appears first in your LD_LIBRARY_PATH
(i.e., so that new librdmacm will be found before the OS-installed librdmacm). Or even better yet, if you can fully uninstall all the OS packages needed for IB support, and install a whole new IB stack in an alternate location (e.g., /opt/hpc
or wherever Nvidia installs all of its stuff these days -- my point is to not install the libraries and whatnot under /usr/lib
, or wherever your OS installs libraries by default). This would mean that there is zero confusion between the OS IB stack and a new/modern IB stack. Open MPI and UCX can definitely work in this kind of scenario, if you're interested in investigating it.
One big disclaimer: I don't follow the SLES distro and the IB software stacks these days; I don't know if there's anything in the SLES 12 kernel, for example, that would explicitly prohibit using new librdmacm / new UCX. E.g., I don't know if you'll need new IB kernel drivers or not.
All that being said, let's talk TCP.
Yes, you can make the TCP BTL be very chatty about what it is doing. Set the MCA parameter btl_base_verbose
to 100. For example, mpirun --mca btl_base_verbose 100 ...
. I don't know the exact syntax for this using bsub. This should make the TCP BTL tell you which IP interface(s) it is using, etc.
Yes, unfortunately, upgrading the OS is a major undertaking and is not an option at this time.
I ran some additional tests with one of our parallel applications on a portion of our cluster that has been partitioned off for investigation of this issue. This portion does not seem to have the TCP connect() error, but it does exhibit another issue that I've seen with OpenMPI 4.1 versus 3.1: considerably more erratic performance.
These jobs all use 16 processes on systems that have 28 slots each, so there is relatively limited communication between hosts - many of the jobs should just be using vader. Here's the performance with OpenMPI 3.1.0:
Execution time on 16 processor(s): 9 min, 29.7 sec
Execution time on 16 processor(s): 6 min, 49.9 sec
Execution time on 16 processor(s): 7 min, 11.8 sec
Execution time on 16 processor(s): 7 min, 24.7 sec
Execution time on 16 processor(s): 7 min, 12.0 sec
Execution time on 16 processor(s): 10 min, 50.0 sec
Execution time on 16 processor(s): 7 min, 4.5 sec
Execution time on 16 processor(s): 7 min, 20.0 sec
Execution time on 16 processor(s): 6 min, 25.8 sec
Here's the same application compiled with OpenMPI 4.1.4:
Execution time on 16 processor(s): 15 min, 46.8 sec
Execution time on 16 processor(s): 25 min, 14.7 sec
Execution time on 16 processor(s): 25 min, 46.3 sec
Execution time on 16 processor(s): 13 min, 17.6 sec
Execution time on 16 processor(s): 18 min, 41.2 sec
Execution time on 16 processor(s): 45 min, 53.3 sec
Execution time on 16 processor(s): 20 min, 23.6 sec
Execution time on 16 processor(s): 21 min, 26.3 sec
Execution time on 16 processor(s): 20 min, 21.1 sec
I've attached outputs generated with --mca pml_base_verbose 100 --mca btl_base_verbose 100
. Any idea where I should look to identify the problem here?
Bump. Any thoughts on how to narrow down the problem?
from the logs, Open MPI 3.1.0 uses both eth0
and ib0
, but Open MPI 4.1.4 only uses eth0
.
I suggest you try forcing ib0
and see how it goes
mpirun --mca btl_tcp_if_include ib0 ...
@ggouaillardet is right. But I see that the v4.1.x log is also using the sppp
interface -- I don't know what that is offhand.
In both versions of Open MPI, I'd suggest what @ggouaillardet suggested: force the use of ib0
. Splitting network traffic over a much-slower eth0
and a much-faster ib0
can have weird performance effects.
Have you tried uninstalling the OS IB stack and installing your own, per my prior comment?
Forcing the use of ib0
with OpenMPI 4.1.4 does not seem to improve the performance.
Part of the dysfunction here may be differing versions of OFED being installed on the build machine as compared to the rest of the cluster. (I'm asking the admins to look into it.) I thought 3.1.0 was using IB via TCP, but that seems to not be correct - I see openib
being cited in the in the 3.1.0 verbose output. OpenMPI 3.1.0 may have been compiled at a time prior to this mismatch.
If I force 3.1.0 to use tcp
, the performance deteriorates a little bit, but not nearly to the extent seen in 4.1.4. So there still seems to be something causing tcp
performance to drag in 4.1.4.
I just re-read your comments and see this:
These jobs all use 16 processes on systems that have 28 slots each, so there is relatively limited communication between hosts - many of the jobs should just be using vader. Here's the performance with OpenMPI 3.1.0:
Does this mean each run is on a single node, launching MPI processes on 16 out of 28 total cores?
There are three nodes in this special testing queue - each with 28 slots, so 84 slots in total. I submitted ten 16-process jobs to the queue, so about half of them would run entirely within a single node and the other half would be split between nodes.
Oh, that makes a huge difference.
If an MPI job is running entirely on a single node, it won't use TCP at all: it will use shared memory to communicate on-node. More generally, Open MPI processes will use shared memory (which is significantly faster than both TCP and native IB) to communicate with peers that are on the same node, and will use some kind of network to communicate with peers off-node.
So if your jobs end up having different numbers of on-node / off-node peers, that can certainly explain why there's variations in total execution times.
That being said, it doesn't explain why there's large differences between v3.x and v4.x. It would be good to get some apples-to-apples comparisons between v3.x and v4.x, though. Let's get the network out of the equation, and only test shared memory as an MPI transport. That avoids any questions about IPoIB.
Can you get some timings of all-on-one-node runs with Open MPI v3.x and v4.x?
OK, the machines I was running on got wiped and re-inserted into the general population and some other machines were swapped in to my partition of the network. These new machines are running SLES12-SP5, and the OFED version mismatch issue was sorted out. I re-compiled OpenMPI 4.1.4, and openib
seems to be working better... mostly. I still see some messages to the effect of:
[bl3402:01258] rdmacm CPC only supported when the first QP is a PP QP; skipped
[bl3402:01258] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped
I'm not sure what these mean or how catastrophic they are, but the jobs seem to run with --mca btl ^tcp
when spanning multiple hosts, so the openib btl seems to be working in some capacity.
With --mca btl vader,self
on 4.1.4, I get:
Execution time on 12 processor(s): 15 min, 44.9 sec
Execution time on 12 processor(s): 14 min, 35.3 sec
Execution time on 12 processor(s): 15 min, 31.1 sec
Execution time on 12 processor(s): 15 min, 1.0 sec
Execution time on 12 processor(s): 14 min, 41.0 sec
Execution time on 12 processor(s): 15 min, 26.4 sec
Execution time on 12 processor(s): 15 min, 29.7 sec
Execution time on 12 processor(s): 15 min, 27.5 sec
Execution time on 12 processor(s): 14 min, 42.8 sec
On Version 3.1.0, I get:
Execution time on 12 processor(s): 15 min, 35.2 sec
Execution time on 12 processor(s): 14 min, 31.4 sec
Execution time on 12 processor(s): 15 min, 27.8 sec
Execution time on 12 processor(s): 14 min, 28.2 sec
Execution time on 12 processor(s): 14 min, 59.4 sec
Execution time on 12 processor(s): 15 min, 39.5 sec
Execution time on 12 processor(s): 15 min, 30.8 sec
Execution time on 12 processor(s): 15 min, 16.6 sec
Execution time on 12 processor(s): 14 min, 45.6 sec
Practically equivalent performance. Interestingly, if I run --mca btl ^tcp
, I see the same inconsistent performance, with some jobs running very slowly. However, on the last (slowest) job that runs, performance improves dramatically when the other MPI jobs finish. Here are the times (in seconds) for each computational iteration that I see on the last running job:
100.152
101.964
99.710
101.042
102.910
102.894
102.817
102.995
102.481
102.479
82.162
35.575
35.576
35.578
35.599
35.600
35.607
Does that suggest some kind of network configuration issue?
Some clarifying questions:
- With your shared memory tests, are you running with 12 dedicated cores on a single host (and no other MPI processes on the node at the same time)? If not, can you explain exactly how the jobs are run?
- If you're able to run with
openib
, you should probably also be able to run with UCX. Have you tried that?- I keep asking about UCX because it is better supported than
openib
. Indeed,openib
is disappearing in the upcoming Open MPI v5.0 -- the UCX PML will effectively be the only way to run on InfiniBand.
- I keep asking about UCX because it is better supported than
- With your
^tcp
tests, are you mixing multiple jobs on the same host at the same time? Your comment about "performance improves dramatically when the other MPI jobs finish" suggests that there might be some overloading occurring -- i.e., multiple MPI processes are being bound to the same core. You might want to run withmpirun --report-bindings
to see exactly which core(s) each process is being bound to.
- Yes, the shared memory tests are running with 12 dedicated cores and no other simultaneous processes.
- I've gotten the admins to install the UCX devel libraries, and I'm trying the configuration right now. It's an older version (1.4) that's distributed with the OS, but I'm hoping it can be made to work. (I see the warning about 1.8, but hopefully earlier versions are OK.)
- Yes, with the ^tcp jobs, I'm running 16-process jobs on 12-slot hosts. So the division is host1,host2 = 12,4 or 8,8 depending on the machine. I will rerun with
--report-bindings
and post the results.
FYI: You should be able to download and install a later version of UCX yourself (e.g., just install it under your $HOME
, such as to $HOME/install/ucx
or somesuch). It's a 100% userspace library; there's no special permissions needed. Then you can build Open MPI with ./configure --with-ucx=$HOME/install/ucx ...
.
Understood, but current UCX does not work with the version of librdmacm from the OS. In principle, I could install a newer version, but it would be far easier if the OS load set could be made to work.
Job #1, which is performing somewhat slowly has:
[bl3403:19505] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3403:19505] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3403:19505] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3403:19505] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3403:19505] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3403:19505] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3403:19505] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3403:19505] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3403:19505] MCW rank 8 bound to socket 1[core 8[hwt 0]]: [./././././.][././B/././.]
[bl3403:19505] MCW rank 9 bound to socket 1[core 9[hwt 0]]: [./././././.][./././B/./.]
[bl3403:19505] MCW rank 10 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.]
[bl3403:19505] MCW rank 11 bound to socket 1[core 11[hwt 0]]: [./././././.][./././././B]
[bl3402:18730] MCW rank 12 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:18730] MCW rank 13 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:18730] MCW rank 14 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:18730] MCW rank 15 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
Job #2, which is performing very slowly, has:
[bl3402:18717] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:18717] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:18717] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:18717] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3402:18717] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3402:18717] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3402:18717] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3402:18717] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3401:02154] MCW rank 8 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3401:02154] MCW rank 9 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3401:02154] MCW rank 10 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3401:02154] MCW rank 11 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3401:02154] MCW rank 12 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3401:02154] MCW rank 13 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3401:02154] MCW rank 14 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3401:02154] MCW rank 15 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
Seems like there's overlap, no?
I forgot about your librdmacm issue. Yes, you could install that manually, too -- it's also a 100% userspace library.
Yes, those 2 jobs definitely overlap -- that's why you're seeing dramatic slowdowns: multiple MPI processes are are being bound to the same core, and therefore they're fighting for cycles.
At this point, I have to turn you back over to @gpaulsen because I don't know how Open MPI reads the LSF job info and decides which cores to use.
If you are running multiple mpirun
calls that are receiving the same allocation information, then they will overlap as the don't know about each other. It sounds to me like either an error in your bsub
command or a bug in the ORTE internal code that reads the resulting allocation info. If you are saying this worked with OMPI v3.x, I very much doubt the ORTE code changed when going to OMPI v4.x - though someone could easily check the relevant orte/mca/ras
component to see.
@markalle Can you please take a look?
Perhaps some ORTE verbosity will shed some light on things?
What parameters should I set?
If you have built with --enable-debug
, add --mca ras_base_verbose 10
to your mpirun
cmd line.
Are these jobs running at the same time? If they're not running at the same time then I don't think there's any overlap, they both look like 2-host jobs where
Job 1 is: host bl3403 : 12 ranks host bl3402 : 4 ranks
and Job 2 is: host bl3402 : 8 ranks host bl3401 : 8 ranks
But if they're both bsubed simultaneously and are both trying to use bl3402 at the same time then I see what you're saying about overlap.
I don't actually remember which version of OMPI prints full-host affinity output vs which would only show the cgroup it was handed and the binding relative to that cgroup... when it does the latter it leaves the output kind of unclear looking IMO. My expectation is that if those LSF jobs were running at the same time, then LSF should have handed a different cgroup to each job and those cgroups shouldn't overlap each other.
I think those are probably all full-host affinity displays, but when in doubt I just stick my own function somewhere so I know what it's printing. Eg something like:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>
void
print_affinity()
{
int i, n;
char hostname[64];
char *str;
cpu_set_t mask;
n = sysconf(_SC_NPROCESSORS_ONLN);
sched_getaffinity(0, sizeof(mask), &mask);
str = malloc(n + 256);
if (!str) { return; }
gethostname(hostname, 64);
sprintf(str, "%s:%d ", hostname, getpid());
for (i=0; i<n; ++i) {
if (CPU_ISSET(i, &mask)) {
strcat(str, "1");
} else {
strcat(str, "0");
}
}
printf("%s\n", str);
free(str);
}
int
main() {
print_affinity();
return(0);
}
Yes, both jobs were running at the same time.
Re-compiling with --enable-debug
to try Ralph's suggestion. I'm not sure I understand @markalle's suggestion
Job #1: 7*bl3404; 9*bl3402
[bl3404:22417] mca: base: components_register: registering framework ras components
[bl3404:22417] mca: base: components_register: found loaded component lsf
[bl3404:22417] mca: base: components_register: component lsf has no register or open function
[bl3404:22417] mca: base: components_register: found loaded component simulator
[bl3404:22417] mca: base: components_register: component simulator register function successful
[bl3404:22417] mca: base: components_register: found loaded component tm
[bl3404:22417] mca: base: components_register: component tm register function successful
[bl3404:22417] mca: base: components_register: found loaded component slurm
[bl3404:22417] mca: base: components_register: component slurm register function successful
[bl3404:22417] mca: base: components_open: opening ras components
[bl3404:22417] mca: base: components_open: found loaded component lsf
[bl3404:22417] mca: base: components_open: component lsf open function successful
[bl3404:22417] mca: base: components_open: found loaded component simulator
[bl3404:22417] mca: base: components_open: found loaded component tm
[bl3404:22417] mca: base: components_open: component tm open function successful
[bl3404:22417] mca: base: components_open: found loaded component slurm
[bl3404:22417] mca: base: components_open: component slurm open function successful
[bl3404:22417] mca:base:select: Auto-selecting ras components
[bl3404:22417] mca:base:select:( ras) Querying component [lsf]
[bl3404:22417] mca:base:select:( ras) Query of component [lsf] set priority to 75
[bl3404:22417] mca:base:select:( ras) Querying component [simulator]
[bl3404:22417] mca:base:select:( ras) Querying component [tm]
[bl3404:22417] mca:base:select:( ras) Querying component [slurm]
[bl3404:22417] mca:base:select:( ras) Selected component [lsf]
[bl3404:22417] mca: base: close: unloading component simulator
[bl3404:22417] mca: base: close: unloading component tm
[bl3404:22417] mca: base: close: component slurm closed
[bl3404:22417] mca: base: close: unloading component slurm
[bl3404:22417] [[39824,0],0] ras:base:allocate
[bl3404:22417] ras/lsf: New Node (bl3404) [slots=1]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=2]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=3]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=4]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=5]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=6]
[bl3404:22417] ras/lsf: +++ Node (bl3404) [slots=7]
[bl3404:22417] ras/lsf: New Node (bl3402) [slots=1]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=2]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=3]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=4]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=5]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=6]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=7]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=8]
[bl3404:22417] ras/lsf: +++ Node (bl3402) [slots=9]
[bl3404:22417] [[39824,0],0] ras:base:node_insert inserting 2 nodes
[bl3404:22417] [[39824,0],0] ras:base:node_insert updating HNP [bl3404] info to 7 slots
[bl3404:22417] [[39824,0],0] ras:base:node_insert node bl3402 slots 9
[bl3404:22417] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3404:22417] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3404:22417] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3404:22417] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3404:22417] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3404:22417] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3404:22417] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3402:15133] MCW rank 7 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:15133] MCW rank 8 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:15133] MCW rank 9 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:15133] MCW rank 10 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3402:15133] MCW rank 11 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3402:15133] MCW rank 12 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3402:15133] MCW rank 13 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3402:15133] MCW rank 14 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3402:15133] MCW rank 15 bound to socket 1[core 8[hwt 0]]: [./././././.][././B/././.]
Job #2: 12*bl3403; 4*bl3404
[bl3403:16514] mca: base: components_register: registering framework ras components
[bl3403:16514] mca: base: components_register: found loaded component lsf
[bl3403:16514] mca: base: components_register: component lsf has no register or open function
[bl3403:16514] mca: base: components_register: found loaded component simulator
[bl3403:16514] mca: base: components_register: component simulator register function successful
[bl3403:16514] mca: base: components_register: found loaded component tm
[bl3403:16514] mca: base: components_register: component tm register function successful
[bl3403:16514] mca: base: components_register: found loaded component slurm
[bl3403:16514] mca: base: components_register: component slurm register function successful
[bl3403:16514] mca: base: components_open: opening ras components
[bl3403:16514] mca: base: components_open: found loaded component lsf
[bl3403:16514] mca: base: components_open: component lsf open function successful
[bl3403:16514] mca: base: components_open: found loaded component simulator
[bl3403:16514] mca: base: components_open: found loaded component tm
[bl3403:16514] mca: base: components_open: component tm open function successful
[bl3403:16514] mca: base: components_open: found loaded component slurm
[bl3403:16514] mca: base: components_open: component slurm open function successful
[bl3403:16514] mca:base:select: Auto-selecting ras components
[bl3403:16514] mca:base:select:( ras) Querying component [lsf]
[bl3403:16514] mca:base:select:( ras) Query of component [lsf] set priority to 75
[bl3403:16514] mca:base:select:( ras) Querying component [simulator]
[bl3403:16514] mca:base:select:( ras) Querying component [tm]
[bl3403:16514] mca:base:select:( ras) Querying component [slurm]
[bl3403:16514] mca:base:select:( ras) Selected component [lsf]
[bl3403:16514] mca: base: close: unloading component simulator
[bl3403:16514] mca: base: close: unloading component tm
[bl3403:16514] mca: base: close: component slurm closed
[bl3403:16514] mca: base: close: unloading component slurm
[bl3403:16514] [[7874,0],0] ras:base:allocate
[bl3403:16514] ras/lsf: New Node (bl3403) [slots=1]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=2]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=3]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=4]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=5]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=6]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=7]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=8]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=9]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=10]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=11]
[bl3403:16514] ras/lsf: +++ Node (bl3403) [slots=12]
[bl3403:16514] ras/lsf: New Node (bl3404) [slots=1]
[bl3403:16514] ras/lsf: +++ Node (bl3404) [slots=2]
[bl3403:16514] ras/lsf: +++ Node (bl3404) [slots=3]
[bl3403:16514] ras/lsf: +++ Node (bl3404) [slots=4]
[bl3403:16514] [[7874,0],0] ras:base:node_insert inserting 2 nodes
[bl3403:16514] [[7874,0],0] ras:base:node_insert updating HNP [bl3403] info to 12 slots
[bl3403:16514] [[7874,0],0] ras:base:node_insert node bl3404 slots 4
[bl3403:16514] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3403:16514] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3403:16514] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3403:16514] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3403:16514] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3403:16514] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3403:16514] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3403:16514] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3403:16514] MCW rank 8 bound to socket 1[core 8[hwt 0]]: [./././././.][././B/././.]
[bl3403:16514] MCW rank 9 bound to socket 1[core 9[hwt 0]]: [./././././.][./././B/./.]
[bl3403:16514] MCW rank 10 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.]
[bl3403:16514] MCW rank 11 bound to socket 1[core 11[hwt 0]]: [./././././.][./././././B]
[bl3404:22425] MCW rank 12 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3404:22425] MCW rank 13 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3404:22425] MCW rank 14 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3404:22425] MCW rank 15 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
Yeah, there is no isolation being provided here by LSF, and so they are going to overlap 100%. Nothing OMPI can do about it. You need to modify your bsub
command to ensure isolated allocations.
@gregfi What @rhc54 is saying is that LSF is not telling Open MPI specifically which cores to use for each MPI process in the 2 jobs. Hence, Open MPI is just starting with core 0 in the MPI processes in each job. Is there a way you can tell bsub
to pass along such info?
All this being said, can you repeat the same test (run 2 jobs simultaneously on the same node) with Open MPI v3.x, and see if --report-bindings
shows the same thing? It's curious that Open MPI 3.x jobs don't exhibit the same performance issue; it could be because there's some kind of regression somewhere where 3.x is handling affinity of multiple simultaneous jobs on the same node properly and 4.x is not. Re-running the --report-bindings
test with 3.x would be helpful to determine if this is the case.
What @markalle was suggesting as a simple C program that shows more-or-less the same output as --report-bindings
, but a bit more explicitly. Specifically, his program shows the affinity of a given process relative to the entire node on which the process is running. He suggested this because he wasn't sure if --report-bindings
showed that, or whether it only showed the binding relative to the cgroup (i.e., set of processors) which LSF created for that specific job.
The output looks a little different on 3.1.0. I'm not sure what to make of it. Performance on both jobs is good. Here's Job 1 (12bl3404; 4bl3401):
[bl3401:00591] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00591] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00591] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00591] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 5 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 7 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3404:24348] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3404:24348] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
Here's Job 2 (8bl3401; 8bl3402):
[bl3401:00604] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00604] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00604] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 5 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3401:00604] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3401:00604] MCW rank 7 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[bl3402:17004] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[bl3402:17004] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
Those two jobs are binding to socket, not core, and thus you won't see much difference as the procs are all sharing the processors in each socket anyway. Your prior example output showed mpirun binding each proc to core - hence the performance difference.