ompi icon indicating copy to clipboard operation
ompi copied to clipboard

IMB-EXT stalls using openmpi 2.1.3

Open nmorey opened this issue 7 years ago • 26 comments

Running IMB-EXT from Intel (R) MPI Benchmarks 2018 Update 1 on a SLE12-SP3

rdma03:~/hpc-testing/:[0]# ompi_info --version
Open MPI v2.1.3.0.cfd8f3f34e27
rdma03:~/hpc-testing/:[0]# lspci | grep Mell
02:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
rdma04:~/:[0]# lspci | grep Mell
0a:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]

Running with this command stalls

rdma03:~/hpc-testing/:[0]# mpirun --host 192.168.0.1,192.168.0.2 -np 2  --allow-run-as-root --mca btl openib /usr/lib64/mpi/gcc/openmpi2/tests/IMB/IMB-EXT
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-2 part    
#------------------------------------------------------------
# Date                  : Tue Mar 27 10:42:19 2018
# Machine               : x86_64
# System                : Linux
# Release               : 4.4.120-94.17-default
# Version               : #1 SMP Wed Mar 14 17:23:00 UTC 2018 (cf3a7bb)
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# /usr/lib64/mpi/gcc/openmpi2/tests/IMB/IMB-EXT

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Window
# Unidir_Get
# Unidir_Put
# Bidir_Get
# Bidir_Put
# Accumulate

And then it stalls like this. Both nodes have an IMB-EXT process running at 100%

On first host:

(gdb) bt
#0  opal_atomic_unlock (lock=0x7f0fd17454e4 <mca_coll_libnbc_component+708>) at ../../../../opal/include/opal/sys/atomic_impl.h:435
#1  ompi_coll_libnbc_progress () at coll_libnbc_component.c:295
#2  0x00007f0fe2aadd94 in opal_progress () at runtime/opal_progress.c:226
#3  0x00007f0fe360b515 in sync_wait_st (sync=<optimized out>) at ../opal/threads/wait_sync.h:80
#4  ompi_request_default_wait_all (count=2, requests=0x7ffde91a2f20, statuses=0x0) at request/req_wait.c:221
#5  0x00007f0fe3652c7c in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=<optimized out>, rbuf=0x7ffde91a3010, count=4, dtype=0x7f0fe3897a40 <ompi_mpi_long>, op=<optimized out>, comm=<optimized out>, module=0x103c3a0)
    at base/coll_base_allreduce.c:225
#6  0x00007f0fd06d6f2c in ompi_osc_rdma_check_parameters (size=0, disp_unit=1, module=0x103a140) at osc_rdma_component.c:1054
#7  ompi_osc_rdma_component_select (win=0x103a060, base=0x7ffde91a3088, size=0, disp_unit=1, comm=0x1037cd0, info=0x6136a0 <ompi_mpi_info_null>, flavor=1, model=0x7ffde91a3094) at osc_rdma_component.c:1182
#8  0x00007f0fe360ec2c in ompi_win_create (base=base@entry=0x1003fd0, size=size@entry=0, disp_unit=disp_unit@entry=1, comm=comm@entry=0x1037cd0, info=0x6136a0 <ompi_mpi_info_null>, newwin=newwin@entry=0x7ffde91a3418) at win/win.c:236
#9  0x00007f0fe363e9dc in PMPI_Win_create (base=0x1003fd0, size=0, disp_unit=1, info=<optimized out>, comm=0x1037cd0, win=0x7ffde91a3418) at pwin_create.c:79
#10 0x000000000040a2f8 in IMB_window ()
#11 0x0000000000406c34 in IMB_init_buffers_iter ()
#12 0x0000000000402448 in main ()

On the second host

(gdb) bt
#0  0x00007f08f56f333e in poll_cq (cqe_ver=0, wc=<optimized out>, ne=<optimized out>, ibcq=0x1f80e00) at ../providers/mlx5/cq.c:931
#1  mlx5_poll_cq (ibcq=0x1f80e00, ne=256, wc=<optimized out>) at ../providers/mlx5/cq.c:1221
#2  0x00007f08ef3c4cc7 in ibv_poll_cq (wc=0x7fff9ccbe810, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:2055
#3  poll_device (device=device@entry=0x1ecac00, count=count@entry=0) at btl_openib_component.c:3581
#4  0x00007f08ef3c5aad in progress_one_device (device=0x1ecac00) at btl_openib_component.c:3714
#5  btl_openib_component_progress () at btl_openib_component.c:3738
#6  0x00007f08fea16d94 in opal_progress () at runtime/opal_progress.c:226
#7  0x00007f08ff573e55 in ompi_request_wait_completion (req=0x256e300) at ../ompi/request/request.h:392
#8  ompi_request_default_wait (req_ptr=0x7fff9ccc19a8, status=0x7fff9ccc19b0) at request/req_wait.c:41
#9  0x00007f08ff5c21ca in ompi_coll_base_sendrecv_zero (stag=-16, rtag=-16, comm=0x23e7570, source=0, dest=0) at base/coll_base_barrier.c:63
#10 ompi_coll_base_barrier_intra_two_procs (comm=0x23e7570, module=<optimized out>) at base/coll_base_barrier.c:296
#11 0x00007f08ec8f86a7 in component_select (win=0x2383ed0, base=0x7fff9ccc1aa8, size=0, disp_unit=1, comm=0x22bb4e0, info=0x6136a0 <ompi_mpi_info_null>, flavor=1, model=0x7fff9ccc1ab4) at osc_pt2pt_component.c:416
#12 0x00007f08ff577c2c in ompi_win_create (base=base@entry=0x21b7fc0, size=size@entry=0, disp_unit=disp_unit@entry=1, comm=comm@entry=0x22bb4e0, info=0x6136a0 <ompi_mpi_info_null>, newwin=newwin@entry=0x7fff9ccc1e38) at win/win.c:236
#13 0x00007f08ff5a79dc in PMPI_Win_create (base=0x21b7fc0, size=0, disp_unit=1, info=<optimized out>, comm=0x22bb4e0, win=0x7fff9ccc1e38) at pwin_create.c:79
#14 0x000000000040a2f8 in IMB_window ()
#15 0x0000000000406c34 in IMB_init_buffers_iter ()
#16 0x0000000000402448 in main ()

nmorey avatar Mar 27 '18 08:03 nmorey

iirc that is a know issue two hosts have different hardware and hence end up selecting different osc components.

as a workaround, you can

mpirun --mca ^osc ...

ggouaillardet avatar Mar 27 '18 11:03 ggouaillardet

I'm redeploying the servers right now. I'll test this ASAP.

$ /usr/lib64/mpi/gcc/openmpi2/bin/ompi_info | grep osc
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v2.1.2)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v2.1.2)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v2.1.2)

As both HW are Infiniband, shouldn't both use rdma automatically ?

nmorey avatar Mar 27 '18 11:03 nmorey

Also:

  • the IMB-MPI1 bench works fine in this setup
  • Both IMB-MPI1 and IMB-EXT works fine with openmpi 1.10.7

nmorey avatar Mar 27 '18 11:03 nmorey

osc is for one-sided communications (not in IMB-MPI 1 and I do not think there is a osc/rdma component in 1.10

You can

mpirun --mca osc_base_verbose 10 ...

to see which component is selected.

ggouaillardet avatar Mar 27 '18 12:03 ggouaillardet

Here's what I get

[rdma03:14356] mca: base: components_register: registering framework osc components
[rdma03:14356] mca: base: components_register: found loaded component pt2pt
[rdma03:14356] mca: base: components_register: component pt2pt register function successful
[rdma03:14356] mca: base: components_register: found loaded component rdma
[rdma03:14356] mca: base: components_register: component rdma register function successful
[rdma03:14356] mca: base: components_register: found loaded component sm
[rdma03:14356] mca: base: components_register: component sm has no register or open function
[rdma03:14356] mca: base: components_open: opening osc components
[rdma03:14356] mca: base: components_open: found loaded component pt2pt
[rdma03:14356] mca: base: components_open: found loaded component rdma
[rdma03:14356] mca: base: components_open: found loaded component sm
[rdma03:14356] mca: base: components_open: component sm open function successful
[rdma04:12703] mca: base: components_register: registering framework osc components
[rdma04:12703] mca: base: components_register: found loaded component pt2pt
[rdma04:12703] mca: base: components_register: component pt2pt register function successful
[rdma04:12703] mca: base: components_register: found loaded component rdma
[rdma04:12703] mca: base: components_register: component rdma register function successful
[rdma04:12703] mca: base: components_register: found loaded component sm
[rdma04:12703] mca: base: components_register: component sm has no register or open function
[rdma04:12703] mca: base: components_open: opening osc components
[rdma04:12703] mca: base: components_open: found loaded component pt2pt
[rdma04:12703] mca: base: components_open: found loaded component rdma
[rdma04:12703] mca: base: components_open: found loaded component sm
[rdma04:12703] mca: base: components_open: component sm open function successful

I do not think there is a osc/rdma component in 1.10

The seems to be one:

[(master) nmorey@portia:openmpi ((v1.10.7^0) %)]$ ll ompi/mca/osc/
total 120
drwxr-xr-x 2 nmorey users   146 Mar 27 13:47 base
-rw-r--r-- 1 nmorey users  1139 Mar 27 13:47 Makefile.am
-rw-r--r-- 1 nmorey users 92603 Nov 20 16:58 Makefile.in
-rw-r--r-- 1 nmorey users 19791 Mar 27 13:47 osc.h
drwxr-xr-x 2 nmorey users   278 Mar 27 13:47 portals4
drwxr-xr-x 2 nmorey users  4096 Mar 27 13:47 pt2pt
drwxr-xr-x 2 nmorey users    25 Mar 27 13:47 rdma
``
drwxr-xr-x 2 nmorey users   168 Mar 27 13:47 sm

nmorey avatar Mar 27 '18 12:03 nmorey

I will double check that

What happens when you blacklist the osc/rdma component ?

ggouaillardet avatar Mar 27 '18 12:03 ggouaillardet

Is this the only log you get when the benchmark hangs ?

ggouaillardet avatar Mar 27 '18 12:03 ggouaillardet

Doing this gets it working:

mpirun  --host 192.168.0.1,192.168.0.2 -np 2  --allow-run-as-root --mca btl openib --mca osc ^rdma  /usr/lib64/mpi/gcc/openmpi2/tests/IMB/IMB-EXT

nmorey avatar Mar 27 '18 12:03 nmorey

Is this the only log you get when the benchmark hangs ?

No warning/error. Just the last printf hanging there

nmorey avatar Mar 27 '18 12:03 nmorey

Can you collect the same traces with IMB-EXT and 1.10 ?

ggouaillardet avatar Mar 27 '18 12:03 ggouaillardet

You're right, the osc/rdma is not available in 1.10.7 (at least in our build)

Using openmpi 1.10.7

rdma03:~/:[0]# mpirun --mca osc_base_verbose 10 --host 192.168.0.1,192.168.0.2 -np 2  --allow-run-as-root /usr/lib64/mpi/gcc/openmpi/tests/IMB/IMB-EXT
[rdma04:14655] mca: base: components_register: registering osc components
[rdma04:14655] mca: base: components_register: found loaded component pt2pt
[rdma04:14655] mca: base: components_register: component pt2pt register function successful
[rdma04:14655] mca: base: components_register: found loaded component sm
[rdma04:14655] mca: base: components_register: component sm has no register or open function
[rdma04:14655] mca: base: components_open: opening osc components
[rdma04:14655] mca: base: components_open: found loaded component pt2pt
[rdma04:14655] mca: base: components_open: component pt2pt open function successful
[rdma04:14655] mca: base: components_open: found loaded component sm
[rdma04:14655] mca: base: components_open: component sm open function successful
[rdma03:17554] mca: base: components_register: registering osc components
[rdma03:17554] mca: base: components_register: found loaded component pt2pt
[rdma03:17554] mca: base: components_register: component pt2pt register function successful
[rdma03:17554] mca: base: components_register: found loaded component sm
[rdma03:17554] mca: base: components_register: component sm has no register or open function
[rdma03:17554] mca: base: components_open: opening osc components
[rdma03:17554] mca: base: components_open: found loaded component pt2pt
[rdma03:17554] mca: base: components_open: component pt2pt open function successful
[rdma03:17554] mca: base: components_open: found loaded component sm
[rdma03:17554] mca: base: components_open: component sm open function successful
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'rdma03', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-2 part    
#------------------------------------------------------------
# Date                  : Tue Mar 27 14:40:24 2018
# Machine               : x86_64
# System                : Linux
# Release               : 4.4.73-5-default
# Version               : #1 SMP Tue Jul 4 15:33:39 UTC 2017 (b7ce4e4)
# MPI Version           : 3.0
# MPI Thread Environment: 


# Calling sequence was: 

# /usr/lib64/mpi/gcc/openmpi/tests/IMB/IMB-EXT

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Window
# Unidir_Get
# Unidir_Put
# Bidir_Get
# Bidir_Put
# Accumulate
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4

[Cut here as it goes on and on]

nmorey avatar Mar 27 '18 12:03 nmorey

Any chance to test the latest master ? I cannot remember if we fixed that (and only a backport is needed)

@hjelmn any recollection of this issue ?

ggouaillardet avatar Mar 27 '18 13:03 ggouaillardet

I have an openmpi 3.0.0 package available that I can test quickly if that's of any interest. Anything else will need some more time

nmorey avatar Mar 27 '18 13:03 nmorey

That will be enough for now, thanks

ggouaillardet avatar Mar 27 '18 13:03 ggouaillardet

@ggouaillardet openmpi 3.0.0 behaves exactly like 2.1.3 and stalls

nmorey avatar Mar 27 '18 14:03 nmorey

it seems this has never been fixed, even on master.

can you please give the inline patch a try ? this is really a proof of concept at this stage.

diff --git a/ompi/mca/osc/rdma/osc_rdma_component.c b/ompi/mca/osc/rdma/osc_rdma_component.c
index b5c544a..db450ca 100644
--- a/ompi/mca/osc/rdma/osc_rdma_component.c
+++ b/ompi/mca/osc/rdma/osc_rdma_component.c
@@ -767,6 +767,7 @@ static int ompi_osc_rdma_query_btls (ompi_communicator_t *comm, struct mca_btl_b
     int *btl_counts = NULL;
     char **btls_to_use;
     void *tmp;
+    int tmps[3];
 
     btls_to_use = opal_argv_split (ompi_osc_rdma_btl_names, ',');
     if (btls_to_use) {
@@ -793,6 +794,20 @@ static int ompi_osc_rdma_query_btls (ompi_communicator_t *comm, struct mca_btl_b
         *btl = selected_btl;
     }
 
+    tmps[0] = (NULL==selected_btl)?0:1;
+    rc = comm->c_coll->coll_allreduce(tmps, tmps+1, 1, MPI_INT, MPI_MAX, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS != rc) {
+        return rc;
+    }
+    tmps[2] = (tmps[0] == tmps[1]) ? 1 : 0;
+    rc = comm->c_coll->coll_allreduce(tmps+2, tmps, 1, MPI_INT, MPI_MIN, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS != rc) {
+        return rc;
+    }
+    if (!tmps[0]) {
+        return OMPI_ERR_NOT_AVAILABLE;
+    }
+
     if (NULL != selected_btl) {
         OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_INFO, "selected btl: %s",
                          selected_btl->btl_component->btl_version.mca_component_name);

ggouaillardet avatar Mar 28 '18 01:03 ggouaillardet

@ggouaillardet Not a configuration I have or care about. If your patch fixes it let me know. BTW, you can get the same result using a single allreduce:

tmps[0] = (NULL==selected_btl)?0:1; tmps[1] = -tmps[0];
rc = comm->c_coll->coll_allreduce(MPI_IN_PLACE, tmps, 2, MPI_INT, MPI_MAX, comm, comm->c_coll->coll_allreduce_module);
if (tmps[0] != -tmps[1]) {
    /* results differ */
   return OMPI_ERR_NOT_AVAILABLE;
}

hjelmn avatar Mar 28 '18 01:03 hjelmn

Though I do find it odd that ConnectIB doesn't select the verbs btl.

Will not be an issue when the uct btl is in place. For reference see #4919. Will probably go in later this week once I have verified it works with IB.

hjelmn avatar Mar 28 '18 01:03 hjelmn

@hjelmn thanks for the comment, I will definitely use a single allreduce.

ggouaillardet avatar Mar 28 '18 02:03 ggouaillardet

@ggouaillardet Had to fix a compile error in your patch (s/com->c_coll->/com->c_coll./g) but if fixes the issue

nmorey avatar Mar 28 '18 07:03 nmorey

Here is a more correct patch

[EDIT] use MPI_MIN instead of MPI_MAX

diff --git a/ompi/mca/osc/rdma/osc_rdma_component.c b/ompi/mca/osc/rdma/osc_rdma_component.c
index b145395..069c9dc 100644
--- a/ompi/mca/osc/rdma/osc_rdma_component.c
+++ b/ompi/mca/osc/rdma/osc_rdma_component.c
@@ -372,6 +372,8 @@ static int ompi_osc_rdma_component_query (struct ompi_win_t *win, void **base, s
                                           int flavor)
 {
 
+    int rc;
+
     if (MPI_WIN_FLAVOR_SHARED == flavor) {
         return -1;
     }
@@ -385,15 +387,18 @@ static int ompi_osc_rdma_component_query (struct ompi_win_t *win, void **base, s
     }
 #endif /* OPAL_CUDA_SUPPORT */
 
-    if (OMPI_SUCCESS == ompi_osc_rdma_query_mtls ()) {
+    rc = ompi_osc_rdma_query_mtls ();
+    rc = comm->c_coll->coll_allreduce(MPI_IN_PLACE, &rc, 1, MPI_INT, MPI_MIN, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS == rc) {
         return 5; /* this has to be lower that osc pt2pt default priority */
     }
 
-    if (OMPI_SUCCESS != ompi_osc_rdma_query_btls (comm, NULL)) {
+    rc = ompi_osc_rdma_query_btls (comm, NULL);
+    rc = comm->c_coll->coll_allreduce(MPI_IN_PLACE, &rc, 1, MPI_INT,  MPI_MIN, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS != rc) {
         return -1;
     }
 
-
     return mca_osc_rdma_component.priority;
 }

similar porting has to be done for the v2.x series

I will resume my work next week

ggouaillardet avatar Mar 28 '18 08:03 ggouaillardet

Keep in mind that the patch will hurt performance for RMA. If the two systems can talk over infiniband and you want performance you need to figure out why one of the systems is not getting a valid openib btl module.

hjelmn avatar Mar 28 '18 16:03 hjelmn

I will look into that. But does the patch has an impact on systems working as expected ?

nmorey avatar Mar 28 '18 16:03 nmorey

It shouldn't. In the common case the same BTL will be selected by all processes and we should get OMPI_SUCCESS in rc. I can double-check once we finish service time on our systems.

hjelmn avatar Mar 28 '18 16:03 hjelmn

@nmorey @hjelmn @ggouaillardet There's been no new updates on here for months. Is this issue still happening at the HEAD of master / release branches?

jsquyres avatar Sep 14 '18 19:09 jsquyres

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Feb 16 '24 21:02 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

github-actions[bot] avatar Mar 02 '24 01:03 github-actions[bot]