ompi icon indicating copy to clipboard operation
ompi copied to clipboard

RMA Functionality

Open bjpalmer opened this issue 4 years ago • 89 comments

I'm using MPI 4.0.1 installed from a system module. Our cluster is running RHEL6 with an infiniband network. The cores are Intel Xeon processors.

I’m curious about the status of RMA functionality in OpenMPI.

We’ve had a MPI-RMA based runtime in Global Arrays for some time now but it does not work well with OpenMPI. It does, however, operate reasonably well with other MPI implementations. I’ve run the GA test suite using the MPI-RMA runtime against different implementations of MPI-3 and the OpenMPI implementation performs significantly worse than implementations from MPICH, Intel and MVAPICH. The current test suite contains 80 programs. Using Intel MPI/5.1.3.181 we get 3 failures, with MPICH 3.3.2 we get 8 failures (MPICH 3.4a2 reduces this to 3 failures) and MVAPICH 2.3.2 reports 9 failures. In comparison, OpenMPI 4.0.1 reports 47 failures. We do have a few tests that are problematic across all implementations and we are looking at them from the GA side.

Our standard test configuration is to run on 4 processors split between two nodes. This forces at least some of the communication on to the switch. I can provide additional details about configuring and running the GA test suite as well as what tests are failing, if you are interested.

Given that we get much better results with other implementations, we believe the large number of failures we are seeing with OpenMPI are reflective of problems in the OpenMPI implementation of RMA functions in MPI-2 and 3. Are there any plans to improve these implementations? We have alternative implementations based on traditional 2-sided MPI that are robust and reasonably high performing, but we are still interested in using runtimes that sit directly on top of MPI RMA constructs. Are there plans going forward to aggressively improve the RMA implementations?

bjpalmer avatar Jun 12 '20 19:06 bjpalmer

Can you share your test suite? The RMA implementation used by Open MPI depends on your set up. There are two implementations that support infiniband and one that provides basic functionality. If the latter one (osc/pt2pt) is on use failures wouldn't be surprising as it is not as well tested. osc/rdma is well tested and provides high performance. osc/ucx still has some bugs but provides goods performance.

hjelmn avatar Jun 12 '20 19:06 hjelmn

The test suite is part of GA so if you download GA from our github repository, you will pick it up. The github repository is at

git clone https://github.com/GlobalArrays/ga.git git checkout develop

I've been configuring with the MPI-RMA runtime using the line

./configure --enable-i4 --enable-cxx --with-mpi3
--prefix=/my/install/directory CC=mpicc CXX=mpicxx FC=mpif90
CFLAGS="-g" CXXFLAG="-g" FFLAGS="-g"

I usually don't build in the top level directory, so you'll need to modify the location of configure accordingly if you are building somewhere else. You don't actually need to do 'make install' to run the test suite, just make.

After running make, run the following command in the build directory to run the test suite

make check-ga MPIEXEC="mpirun -n 4 "

Our standard configuration is 4 processors on 2 nodes.

When I run the test suite with the tests mir_perf2.x, perf2.x and thread_perf_contig.x hang. To get the full test suite to run, I've been editing the mir_perf2.F, perf2.c and thread_perf_contig.c files and adding either a "stop" (Fortran) or a "return 0;" (C) at the start of the main routine. On our platform, the remaining tests run and either pass or crash.

Let me know if you need more information.

bjpalmer avatar Jun 12 '20 20:06 bjpalmer

For got to mention, the source code for the test codes is $GA/global/testing.

bjpalmer avatar Jun 12 '20 21:06 bjpalmer

Is there a way to separate out building the tests from running them? The system I'm using doesn't support building on the compute nodes.

hppritcha avatar Jun 15 '20 16:06 hppritcha

@janjust - FYI - let's test this example on the latest greatest UCX/OSC goodies targeting OMPI v5.0 release.

jladd-mlnx avatar Jun 15 '20 16:06 jladd-mlnx

Might be a Fortran binding issue. I don't know how well those are tested.

hjelmn avatar Jun 15 '20 16:06 hjelmn

If you type

make checkprogs

in the build directory, it should build all the executables in the test directory.

bjpalmer avatar Jun 15 '20 18:06 bjpalmer

Some of the fortran tests pass and quite a few of the C tests fail. The underlying GA library is all written in C and the Fortran interface is a thin compatibility layer.

bjpalmer avatar Jun 15 '20 19:06 bjpalmer

Here's what I'm seeing on an aarch64 platform using gcc 9.3.0, ompi 4.0.4, and ucx 1.7:

PASS: ma/test-coalesce.x
PASS: ma/test-inquire.x
XFAIL: ma/testf.x
FAIL: global/testing/elempatch.x
PASS: global/testing/getmem.x
PASS: global/testing/mtest.x
PASS: global/testing/mulmatpatchc.x
PASS: global/testing/normc.x
PASS: global/testing/matrixc.x
PASS: global/testing/ntestc.x
PASS: global/testing/ntestfc.x
PASS: global/testing/packc.x
PASS: global/testing/patch_enumc.x
PASS: global/testing/print.x
PASS: global/testing/scan_addc.x
PASS: global/testing/scan_copyc.x
PASS: global/testing/testc.x
PASS: global/testing/testmatmultc.x
PASS: global/testing/testmult.x
PASS: global/testing/testmultrect.x
PASS: global/testing/gemmtest.x
PASS: global/testing/thread_perf_contig.x
PASS: global/testing/thread_perf_strided.x
PASS: global/testing/threadsafec.x
PASS: global/testing/read_only.x
PASS: global/testing/unpackc.x
PASS: global/testing/bin.x
PASS: global/testing/blktest.x
PASS: global/testing/g2test.x
PASS: global/testing/g3test.x
PASS: global/testing/ga_lu.x
PASS: global/testing/ga_shift.x
FAIL: global/testing/ghosts.x
PASS: global/testing/jacobi.x
PASS: global/testing/mir_perf2.x
PASS: global/testing/mmatrix.x
PASS: global/testing/mulmatpatch.x
PASS: global/testing/nbtest.x
PASS: global/testing/nb2test.x
PASS: global/testing/ndim.x
PASS: global/testing/patch.x
PASS: global/testing/patch2.x
PASS: global/testing/patch_enumf.x
PASS: global/testing/perfmod.x
PASS: global/testing/perform.x
PASS: global/testing/perf.x
PASS: global/testing/perf2.x
FAIL: global/testing/pg2test.x
FAIL: global/testing/pgtest.x
FAIL: global/testing/scan.x
PASS: global/testing/simple_groups.x
PASS: global/testing/sparse.x
PASS: global/testing/sprsmatmult.x
PASS: global/testing/stride.x
PASS: global/testing/testeig.x
PASS: global/testing/testmatmult.x
PASS: global/testing/testsolve.x
FAIL: global/testing/test.x
PASS: global/testing/simple_groups_comm.x
PASS: global/testing/ga-mpi.x
PASS: global/testing/lock.x
PASS: global/testing/simple_groups_commc.x
PASS: global/testing/nga-onesided.x
PASS: global/testing/nga-patch.x
PASS: global/testing/nga-periodic.x
PASS: global/testing/nga-scatter.x
PASS: global/testing/nga-util.x
PASS: global/testing/ngatest.x
PASS: global/examples/lennard-jones/lennard.x
PASS: global/examples/boltzmann/boltz.x
PASS: ga++/testing/elempatch.x
PASS: ga++/testing/mtest.x
PASS: ga++/testing/ntestc.x
PASS: ga++/testing/testc.x
PASS: ga++/testing/testmult.x
FAIL: ga++/testing/threadsafecpp.x
=================================
7 of 77 tests failed
See ./test-suite.log
Please report to [email protected]
=================================

I used the configure options given in an above comment. I had export OMPI_MCA_osc=ucx export OMPI_MCA_pml=ucx set in my shell.

hppritcha avatar Jun 15 '20 19:06 hppritcha

@hppritcha Can you run with osc/rdma as a sanity check?

hjelmn avatar Jun 15 '20 20:06 hjelmn

Out of curiosity, I ran the tests on my local shared memory machine using current master and osc rdma. I see 42 tests failing, most (all?) of which seem to end up in a similar Segfault:

Thread 1 "mtest.x" received signal SIGSEGV, Segmentation fault.
0x00007fffe76e0c3a in opal_thread_add_fetch_32 (addr=0x1, delta=-1) at ../../../../../opal/mca/threads/thread_usage.h:156
156     OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, add, +, 32)
(gdb) bt
#0  0x00007fffe76e0c3a in opal_thread_add_fetch_32 (addr=0x1, delta=-1) at ../../../../../opal/mca/threads/thread_usage.h:156
#1  0x00007fffe76e1729 in wait_sync_update (sync=0x1, updates=1, status=0) at ../../../../../opal/mca/threads/wait_sync.h:43
#2  0x00007fffe76e1888 in ompi_request_complete (request=0x555555aacd00, with_signal=true) at ../../../../../ompi/request/request.h:454
#3  0x00007fffe76e23b5 in ompi_osc_rdma_request_complete (request=0x555555aacd00, mpi_error=0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_request.h:111
#4  0x00007fffe76e3e1d in ompi_osc_rdma_put_complete (btl=0x7fffebdf8220 <mca_btl_sm>, endpoint=0x555555a25e90, local_address=0x555555a59800, local_handle=0x0, context=0x555555aacd01, data=0x0, status=0)
    at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:395
#5  0x00007fffebbf2f16 in mca_btl_sm_put_cma (btl=0x7fffebdf8220 <mca_btl_sm>, endpoint=0x555555a25e90, local_address=0x555555a59800, remote_address=93824997456812, local_handle=0x0, remote_handle=0x0, size=4, flags=0, order=255, 
    cbfunc=0x7fffe76e3d9a <ompi_osc_rdma_put_complete>, cbcontext=0x555555aacd01, cbdata=0x0) at ../../../../../opal/mca/btl/sm/btl_sm_put.c:87
#6  0x00007fffe76e4095 in ompi_osc_rdma_put_real (sync=0x555555a4e9b0, peer=0x555555a316f0, target_address=93824997456812, target_handle=0x0, ptr=0x555555a59800, local_handle=0x0, size=4, cb=0x7fffe76e3d9a <ompi_osc_rdma_put_complete>, 
    context=0x555555aacd01, cbdata=0x0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:451
#7  0x00007fffe76e4342 in ompi_osc_rdma_put_contig (sync=0x555555a4e9b0, peer=0x555555a316f0, target_address=93824997456812, target_handle=0x0, source_buffer=0x555555a59800, size=4, request=0x555555aacd00)
    at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:522
#8  0x00007fffe76e3651 in ompi_osc_rdma_master_noncontig (sync=0x555555a4e9b0, local_address=0x555555a59804, local_count=1, local_datatype=0x555555995e20, peer=0x555555a316f0, remote_address=93824997447616, remote_handle=0x0, remote_count=1, 
    remote_datatype=0x555555a85f10, request=0x555555aacd00, max_rdma_len=18446744073709551615, rdma_fn=0x7fffe76e414e <ompi_osc_rdma_put_contig>, alloc_reqs=false) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:264
#9  0x00007fffe76e3cdc in ompi_osc_rdma_master (sync=0x555555a4e9b0, local_address=0x555555a59804, local_count=1, local_datatype=0x555555995e20, peer=0x555555a316f0, remote_address=93824997447616, remote_handle=0x0, remote_count=1, 
    remote_datatype=0x555555a85f10, request=0x555555aacd00, max_rdma_len=18446744073709551615, rdma_fn=0x7fffe76e414e <ompi_osc_rdma_put_contig>, alloc_reqs=false) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:358
#10 0x00007fffe76e4ee9 in ompi_osc_rdma_put_w_req (sync=0x555555a4e9b0, origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, peer=0x555555a316f0, target_disp=0, target_count=1, target_datatype=0x555555a85f10, 
    request=0x555555aacd00) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:763
#11 0x00007fffe76e53b9 in ompi_osc_rdma_rput (origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, target_rank=0, target_disp=0, target_count=1, target_datatype=0x555555a85f10, win=0x555555a4e240, 
    request=0x7fffffffcf30) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:851
#12 0x00007ffff6794eb4 in PMPI_Rput (origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, target_rank=0, target_disp=0, target_count=1, target_datatype=0x555555a85f10, win=0x555555a4e240, request=0x7fffffffcf30)
    at prput.c:87
#13 0x000055555566b51c in comex_putv (iov=0x555555a70500, iov_len=1, proc=0, group=0) at ../../comex/src-mpi3/comex.c:1455
#14 0x00005555556652f2 in PARMCI_PutV (darr=0x7fffffffd170, len=1, proc=0) at ../../comex/src-armci/armci.c:701
#15 0x0000555555668900 in ARMCI_PutV (darr=0x7fffffffd170, len=1, proc=0) at ../../comex/src-armci/capi.c:431
#16 0x0000555555605e28 in gai_gatscat_new (op=-99, g_a=-1000, v=0x555555a584b0, subscript=0x555555a53680, c_flag=1, nv=2500, locbytes=0x55555589dc50 <GAbytes+80>, totbytes=0x55555589dc58 <GAbytes+88>, alpha=0x0)
    at ../global/src/onesided.c:3559
#17 0x0000555555606328 in pnga_scatter (g_a=-1000, v=0x555555a584b0, subscript=0x555555a53680, c_flag=1, nv=2500) at ../global/src/onesided.c:3633
#18 0x0000555555566ad1 in NGA_Scatter (g_a=-1000, v=0x555555a584b0, subsArray=0x555555a53680, n=2500) at ../global/src/capi.c:3005
#19 0x0000555555557976 in main (argc=1, argv=0x7fffffffd6b8) at ../global/testing/mtest.c:91

Attaching a debugger, it seems that ompi_request_complete is called twice on the same request object, which leads to the Segfault in the second call:

1 invocation:

Thread 1 "mtest.x" hit Breakpoint 2, ompi_request_complete (request=0x555555aacd00, with_signal=true) at ../../../../../ompi/request/request.h:437
437     {
(gdb) bt
#0  ompi_request_complete (request=0x555555aacd00, with_signal=true) at ../../../../../ompi/request/request.h:437
#1  0x00007fffe76e23b5 in ompi_osc_rdma_request_complete (request=0x555555aacd00, mpi_error=0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_request.h:111
#2  0x00007fffe76e3e1d in ompi_osc_rdma_put_complete (btl=0x7fffebdf8220 <mca_btl_sm>, endpoint=0x555555a25e90, local_address=0x555555a59804, local_handle=0x0, 
    context=0x555555aacd01, data=0x0, status=0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:395
#3  0x00007fffebbf2f16 in mca_btl_sm_put_cma (btl=0x7fffebdf8220 <mca_btl_sm>, endpoint=0x555555a25e90, local_address=0x555555a59804, remote_address=93824997457612, 
    local_handle=0x0, remote_handle=0x0, size=4, flags=0, order=255, cbfunc=0x7fffe76e3d9a <ompi_osc_rdma_put_complete>, cbcontext=0x555555aacd01, cbdata=0x0)
    at ../../../../../opal/mca/btl/sm/btl_sm_put.c:87
#4  0x00007fffe76e4095 in ompi_osc_rdma_put_real (sync=0x555555a4e9b0, peer=0x555555a316f0, target_address=93824997457612, target_handle=0x0, ptr=0x555555a59804, local_handle=0x0, 
    size=4, cb=0x7fffe76e3d9a <ompi_osc_rdma_put_complete>, context=0x555555aacd01, cbdata=0x0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:451
#5  0x00007fffe76e4342 in ompi_osc_rdma_put_contig (sync=0x555555a4e9b0, peer=0x555555a316f0, target_address=93824997457612, target_handle=0x0, source_buffer=0x555555a59804, size=4, 
    request=0x555555aacd00) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:522
#6  0x00007fffe76e3651 in ompi_osc_rdma_master_noncontig (sync=0x555555a4e9b0, local_address=0x555555a59804, local_count=1, local_datatype=0x555555995e20, peer=0x555555a316f0, 
    remote_address=93824997447616, remote_handle=0x0, remote_count=1, remote_datatype=0x555555a85f10, request=0x555555aacd00, max_rdma_len=18446744073709551615, 
    rdma_fn=0x7fffe76e414e <ompi_osc_rdma_put_contig>, alloc_reqs=false) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:264
#7  0x00007fffe76e3cdc in ompi_osc_rdma_master (sync=0x555555a4e9b0, local_address=0x555555a59804, local_count=1, local_datatype=0x555555995e20, peer=0x555555a316f0, 
    remote_address=93824997447616, remote_handle=0x0, remote_count=1, remote_datatype=0x555555a85f10, request=0x555555aacd00, max_rdma_len=18446744073709551615, 
    rdma_fn=0x7fffe76e414e <ompi_osc_rdma_put_contig>, alloc_reqs=false) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:358
#8  0x00007fffe76e4ee9 in ompi_osc_rdma_put_w_req (sync=0x555555a4e9b0, origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, peer=0x555555a316f0, 
    target_disp=0, target_count=1, target_datatype=0x555555a85f10, request=0x555555aacd00) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:763
#9  0x00007fffe76e53b9 in ompi_osc_rdma_rput (origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, target_rank=0, target_disp=0, target_count=1, 
    target_datatype=0x555555a85f10, win=0x555555a4e240, request=0x7fffffffcf30) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:851
#10 0x00007ffff6794eb4 in PMPI_Rput (origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, target_rank=0, target_disp=0, target_count=1, 
    target_datatype=0x555555a85f10, win=0x555555a4e240, request=0x7fffffffcf30) at prput.c:87
#11 0x000055555566b51c in comex_putv (iov=0x555555a70500, iov_len=1, proc=0, group=0) at ../../comex/src-mpi3/comex.c:1455
#12 0x00005555556652f2 in PARMCI_PutV (darr=0x7fffffffd170, len=1, proc=0) at ../../comex/src-armci/armci.c:701
#13 0x0000555555668900 in ARMCI_PutV (darr=0x7fffffffd170, len=1, proc=0) at ../../comex/src-armci/capi.c:431
#14 0x0000555555605e28 in gai_gatscat_new (op=-99, g_a=-1000, v=0x555555a584b0, subscript=0x555555a53680, c_flag=1, nv=2500, locbytes=0x55555589dc50 <GAbytes+80>, 
    totbytes=0x55555589dc58 <GAbytes+88>, alpha=0x0) at ../global/src/onesided.c:3559
#15 0x0000555555606328 in pnga_scatter (g_a=-1000, v=0x555555a584b0, subscript=0x555555a53680, c_flag=1, nv=2500) at ../global/src/onesided.c:3633
#16 0x0000555555566ad1 in NGA_Scatter (g_a=-1000, v=0x555555a584b0, subsArray=0x555555a53680, n=2500) at ../global/src/capi.c:3005
#17 0x0000555555557976 in main (argc=1, argv=0x7fffffffd6b8) at ../global/testing/mtest.c:91

Second invocation:

Thread 1 "mtest.x" hit Breakpoint 2, ompi_request_complete (request=0x555555aacd00, with_signal=true) at ../../../../../ompi/request/request.h:437
437     {
(gdb) bt
#0  ompi_request_complete (request=0x555555aacd00, with_signal=true) at ../../../../../ompi/request/request.h:437
#1  0x00007fffe76e23b5 in ompi_osc_rdma_request_complete (request=0x555555aacd00, mpi_error=0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_request.h:111
#2  0x00007fffe76e3e1d in ompi_osc_rdma_put_complete (btl=0x7fffebdf8220 <mca_btl_sm>, endpoint=0x555555a25e90, local_address=0x555555a59800, local_handle=0x0, 
    context=0x555555aacd01, data=0x0, status=0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:395
#3  0x00007fffebbf2f16 in mca_btl_sm_put_cma (btl=0x7fffebdf8220 <mca_btl_sm>, endpoint=0x555555a25e90, local_address=0x555555a59800, remote_address=93824997456812, 
    local_handle=0x0, remote_handle=0x0, size=4, flags=0, order=255, cbfunc=0x7fffe76e3d9a <ompi_osc_rdma_put_complete>, cbcontext=0x555555aacd01, cbdata=0x0)
    at ../../../../../opal/mca/btl/sm/btl_sm_put.c:87
#4  0x00007fffe76e4095 in ompi_osc_rdma_put_real (sync=0x555555a4e9b0, peer=0x555555a316f0, target_address=93824997456812, target_handle=0x0, ptr=0x555555a59800, local_handle=0x0, 
    size=4, cb=0x7fffe76e3d9a <ompi_osc_rdma_put_complete>, context=0x555555aacd01, cbdata=0x0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:451
#5  0x00007fffe76e4342 in ompi_osc_rdma_put_contig (sync=0x555555a4e9b0, peer=0x555555a316f0, target_address=93824997456812, target_handle=0x0, source_buffer=0x555555a59800, size=4, 
    request=0x555555aacd00) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:522
#6  0x00007fffe76e3651 in ompi_osc_rdma_master_noncontig (sync=0x555555a4e9b0, local_address=0x555555a59804, local_count=1, local_datatype=0x555555995e20, peer=0x555555a316f0, 
    remote_address=93824997447616, remote_handle=0x0, remote_count=1, remote_datatype=0x555555a85f10, request=0x555555aacd00, max_rdma_len=18446744073709551615, 
    rdma_fn=0x7fffe76e414e <ompi_osc_rdma_put_contig>, alloc_reqs=false) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:264
#7  0x00007fffe76e3cdc in ompi_osc_rdma_master (sync=0x555555a4e9b0, local_address=0x555555a59804, local_count=1, local_datatype=0x555555995e20, peer=0x555555a316f0, 
    remote_address=93824997447616, remote_handle=0x0, remote_count=1, remote_datatype=0x555555a85f10, request=0x555555aacd00, max_rdma_len=18446744073709551615, 
    rdma_fn=0x7fffe76e414e <ompi_osc_rdma_put_contig>, alloc_reqs=false) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:358
#8  0x00007fffe76e4ee9 in ompi_osc_rdma_put_w_req (sync=0x555555a4e9b0, origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, peer=0x555555a316f0, 
    target_disp=0, target_count=1, target_datatype=0x555555a85f10, request=0x555555aacd00) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:763
#9  0x00007fffe76e53b9 in ompi_osc_rdma_rput (origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, target_rank=0, target_disp=0, target_count=1, 
    target_datatype=0x555555a85f10, win=0x555555a4e240, request=0x7fffffffcf30) at ../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c:851
#10 0x00007ffff6794eb4 in PMPI_Rput (origin_addr=0x555555a59804, origin_count=1, origin_datatype=0x555555995e20, target_rank=0, target_disp=0, target_count=1, 
    target_datatype=0x555555a85f10, win=0x555555a4e240, request=0x7fffffffcf30) at prput.c:87
#11 0x000055555566b51c in comex_putv (iov=0x555555a70500, iov_len=1, proc=0, group=0) at ../../comex/src-mpi3/comex.c:1455
#12 0x00005555556652f2 in PARMCI_PutV (darr=0x7fffffffd170, len=1, proc=0) at ../../comex/src-armci/armci.c:701
#13 0x0000555555668900 in ARMCI_PutV (darr=0x7fffffffd170, len=1, proc=0) at ../../comex/src-armci/capi.c:431
#14 0x0000555555605e28 in gai_gatscat_new (op=-99, g_a=-1000, v=0x555555a584b0, subscript=0x555555a53680, c_flag=1, nv=2500, locbytes=0x55555589dc50 <GAbytes+80>, 
    totbytes=0x55555589dc58 <GAbytes+88>, alpha=0x0) at ../global/src/onesided.c:3559
#15 0x0000555555606328 in pnga_scatter (g_a=-1000, v=0x555555a584b0, subscript=0x555555a53680, c_flag=1, nv=2500) at ../global/src/onesided.c:3633
#16 0x0000555555566ad1 in NGA_Scatter (g_a=-1000, v=0x555555a584b0, subsArray=0x555555a53680, n=2500) at ../global/src/capi.c:3005
#17 0x0000555555557976 in main (argc=1, argv=0x7fffffffd6b8) at ../global/testing/mtest.c:91
(gdb) print *request
$1 = {super = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7fffe7911b80 <ompi_osc_rdma_request_t_class>, obj_reference_count = 1, 
        cls_init_file_name = 0x7fffe7705fb8 "../../../../../ompi/mca/osc/rdma/osc_rdma_comm.c", cls_init_lineno = 847}, opal_list_next = 0x0, opal_list_prev = 0x0, item_free = 1, 
      opal_list_item_refcount = 0, opal_list_item_belong_to = 0x0}, registration = 0x0, ptr = 0x0}, req_type = OMPI_REQUEST_WIN, req_status = {MPI_SOURCE = 0, MPI_TAG = 0, 
    MPI_ERROR = 0, _cancelled = 0, _ucount = 0}, req_complete = 0x1, req_state = OMPI_REQUEST_ACTIVE, req_persistent = false, req_f_to_c_index = -32766, req_start = 0x0, 
  req_free = 0x7fffe76f2f2e <request_free>, req_cancel = 0x7fffe76f2f1c <request_cancel>, req_complete_cb = 0x0, req_complete_cb_data = 0x0, req_mpi_object = {comm = 0x555555a4e240, 
    file = 0x555555a4e240, win = 0x555555a4e240}}

Note the in the second case req_complete is 0x1, which eventually is treated as sync object and derefernced.

I haven't tried to understand why this happens, might dig into that later today.

devreal avatar Jun 16 '20 10:06 devreal

We have vanilla builds of UCX 1.4.0 and 1.8.0 on our machine. I can try running with that to see if I can reproduce @hppritcha's results. What exactly do you need to do to run with UCX? Put the libraries in your LD_LIBRARY_PATH and set the environment variables?

bjpalmer avatar Jun 16 '20 14:06 bjpalmer

@bjpalmer you need to add --with-ucx=path_to_ucx_1.8.0_install to the OMPI configure line and rebuild. Then just to be certain set then env. variables in your shell before running the make check

export OMPI_MCA_osc=ucx
export OMPI_MCA_pml=ucx

hppritcha avatar Jun 16 '20 15:06 hppritcha

@bjpalmer Can these tests potentially hang? Or is there a mechanism to kill jobs that are taking too long - mine seem to be stuck on ga_shift running ompi-master with UCX-v1.8 (both ompi and ucx in debug), but there were other tests that took a while but eventually passed.

janjust avatar Jun 16 '20 15:06 janjust

Some of them can definitely hang. On our system running 4.0.1, the mir_perf2.x, perf2.x and thread_perf_contig.x tests were hanging. I haven't seen any hangs in ga_shift.x. Some of the tests take a significant amount of time before finishing. I don't know of a way to kill tests that hang while the test suite is running. I usually just go edit the source file (they are in GA_DIR/global/testing) and put in a stop or return 0 to get the test to exit before starting. You might be able to get a running test to fail by logging on to one of the nodes that it is executing on and killing one of the processes manually.

bjpalmer avatar Jun 16 '20 16:06 bjpalmer

@bjpalmer, I made a mistake, it's not ga_shift.x that hangs, it's ghosts.x which seems to be a multi-threaded code. The same test failed for @hppritcha.

janjust avatar Jun 16 '20 16:06 janjust

I ran using openmpi 4.0.4 with built with UCX and with the OMPI_MCA_osc and OMPI_MCA_pml environment variables set to ucx. I had two hung processes and a total of 27 failures. I'm not doing as well as @hppritcha and while this is still a lot of failures, it is a considerable improvement over the previous run using Open MPI 4.0.1. I'm compiling with gcc/7.3.0 and using UCX 1.8.0.

bjpalmer avatar Jun 16 '20 16:06 bjpalmer

With the patch in #7829 all tests run successfully on my local shared memory system using osc/rdma. Can someone test osc/rdma on multiple nodes? (I don't have access to a non-IB machine anymore)

devreal avatar Jun 16 '20 17:06 devreal

I ran the test suite using osc/ucx on my local system and see a number of tests failing. It seems that UCX 1.8.0 is complaining about misaligned variables in all failed tests, e.g.:

> Checking disjoint put ... 
[1592329233.441781] [beryl:9227 :0]       amo_send.c:128  UCX  ERROR atomic variable must be naturally aligned (remote address 0x55d5ab64370c, size 8)
[1592329233.441839] [beryl:9227 :0]       amo_send.c:128  UCX  ERROR atomic variable must be naturally aligned (remote address 0x55d5ab643744, size 8)
[1592329233.441868] [beryl:9227 :0]       amo_send.c:128  UCX  ERROR atomic variable must be naturally aligned (remote address 0x55d5ab64377c, size 8)
[1592329233.441895] [beryl:9227 :0]       amo_send.c:128  UCX  ERROR atomic variable must be naturally aligned (remote address 0x55d5ab6437b4, size 8)
[1592329233.441923] [beryl:9227 :0]       amo_send.c:128  UCX  ERROR atomic variable must be naturally aligned (remote address 0x55d5ab6437ec, size 8)
[1592329233.441964] [beryl:9227 :0]       amo_send.c:128  UCX  ERROR atomic variable must be naturally aligned (remote address 0x559bb8e6db04, size 8)

AFAICS, the MPI standard does not mandate any alignment for user-provided input buffers and only mentions alignment in the context of performance (aligned memory may lead to better performance). This looks like a bug in osc/ucx to me.

devreal avatar Jun 16 '20 17:06 devreal

This gets more and more interesting: the UCX error is not caused by the application using misaligned target offsets but by the osc/ucx issuing a 64bit atomic fetch-add inside an Rget to acquire a request:

(gdb) bt
#0  ucs_log_dispatch (file=file@entry=0x7fffeb19da70 "../../../src/ucp/rma/amo_send.c", line=line@entry=128, 
    function=function@entry=0x7fffeb19de20 <__func__.16080> "ucp_atomic_fetch_nb", level=level@entry=UCS_LOG_LEVEL_ERROR, 
    format=format@entry=0x7fffeb19db18 "atomic variable must be naturally aligned (remote address 0x%lx, size %zu)")
    at ../../../src/ucs/debug/log.c:181
#1  0x00007fffeb15800f in ucp_atomic_fetch_nb (ep=<optimized out>, opcode=<optimized out>, value=<optimized out>, 
    result=0x55555659a258, op_size=8, remote_addr=93825006787100, rkey=0x555555d2d380, 
    cb=0x7fffeacd9715 <opal_common_ucx_req_completion>) at ../../../src/ucp/rma/amo_send.c:126
#2  0x00007fffe7d2a48f in opal_common_ucx_atomic_fetch_nb (ep=0x7fffe3d38ba8, opcode=UCP_ATOMIC_FETCH_OP_FADD, value=0, 
    result=0x55555659a258, op_size=8, remote_addr=93825006787100, rkey=0x555555d2d380, 
    req_handler=0x7fffeacd9715 <opal_common_ucx_req_completion>, worker=0x555555cc1350)
    at ../../../../../opal/mca/common/ucx/common_ucx.h:199
#3  0x00007fffe7d2b419 in opal_common_ucx_wpmem_fetch_nb (mem=0x5555565b0600, opcode=UCP_ATOMIC_FETCH_OP_FADD, value=0, 
    target=1, buffer=0x55555659a258, len=8, rem_addr=93825006787100, user_req_cb=0x7fffe7d30e54 <req_completion>, 
    user_req_ptr=0x5555559b5400) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:519
#4  0x00007fffe7d2dfd1 in ompi_osc_ucx_rget (origin_addr=0x55555642a090, origin_count=4, 
    origin_dt=0x555555825ba0 <ompi_mpi_char>, target=1, target_disp=199996, target_count=4, 
    target_dt=0x555555825ba0 <ompi_mpi_char>, win=0x5555565f5470, request=0x7fffffffce30)
    at ../../../../../ompi/mca/osc/ucx/osc_ucx_comm.c:868
#5  0x00007ffff6793f1b in PMPI_Rget (origin_addr=0x55555642a090, origin_count=4, 
    origin_datatype=0x555555825ba0 <ompi_mpi_char>, target_rank=1, target_disp=199996, target_count=4, 
    target_datatype=0x555555825ba0 <ompi_mpi_char>, win=0x5555565f5470, request=0x7fffffffce30) at prget.c:84
#6  0x00005555555fce2c in comex_nbget (src=src@entry=0x555556335a1c, dst=0x55555642a090, dst@entry=0x205d349e1eb45f00, 
    bytes=4, proc=proc@entry=1, group=group@entry=0, hdl=hdl@entry=0x7fffffffcf04) at ../../comex/src-mpi3/comex.c:2212
#7  0x00005555555f6a22 in PARMCI_NbGetS (src_ptr=0x555556335a1c, src_stride_arr=src_stride_arr@entry=0x7fffffffcf44, 
    dst_ptr=0x205d349e1eb45f00, dst_ptr@entry=0x55555642a090, dst_stride_arr=dst_stride_arr@entry=0x7fffffffcf64, 
    count=count@entry=0x7fffffffcf84, stride_levels=stride_levels@entry=0, proc=1, nb_handle=0x7fffffffcf04)
    at ../../comex/src-armci/armci.c:579
#8  0x00005555555f96e5 in ARMCI_NbGetS (src_ptr=<optimized out>, src_stride_arr=src_stride_arr@entry=0x7fffffffcf44, 
    dst_ptr=dst_ptr@entry=0x55555642a090, dst_stride_arr=dst_stride_arr@entry=0x7fffffffcf64, 
    count=count@entry=0x7fffffffcf84, stride_levels=stride_levels@entry=0, proc=1, nb_handle=0x7fffffffcf04)
    at ../../comex/src-armci/capi.c:314
#9  0x00005555555b8646 in ngai_nbgets (nbhandle=0x7fffffffcf04, type_size=4, field_size=-1, field_off=0, 
    proc=<optimized out>, nstrides=0, count=0x7fffffffcf84, stride_loc=0x7fffffffcf64, pbuf=0x55555642a090 "", 
    stride_rem=0x7fffffffcf44, prem=<optimized out>, loc_base_ptr=0x55555642a090 "") at ../global/src/onesided.c:443
---Type <return> to continue, or q <return> to quit---
#10 ngai_gets (type_size=4, field_size=-1, field_off=0, proc=<optimized out>, nstrides=0, count=0x7fffffffcf84, 
    stride_loc=0x7fffffffcf64, pbuf=0x55555642a090 "", stride_rem=0x7fffffffcf44, prem=<optimized out>, 
    loc_base_ptr=0x55555642a090 "") at ../global/src/onesided.c:496
#11 ngai_get_common (g_a=g_a@entry=-1000, lo=lo@entry=0x7fffffffd46c, hi=hi@entry=0x7fffffffd46c, 
    buf=buf@entry=0x55555642a090, ld=ld@entry=0x0, field_off=field_off@entry=0, field_size=-1, nbhandle=0x0)
    at ../global/src/onesided.c:1022
#12 0x00005555555b8dd8 in pnga_get (g_a=g_a@entry=-1000, lo=lo@entry=0x7fffffffd46c, hi=hi@entry=0x7fffffffd46c, 
    buf=buf@entry=0x55555642a090, ld=ld@entry=0x0) at ../global/src/onesided.c:1166
#13 0x00005555555d9de9 in pnga_scan_add (g_src=-1000, g_dst=-999, g_msk=-998, lo=<optimized out>, hi=<optimized out>, 
    excl=<optimized out>) at ../global/src/types2.xh:1
#14 0x000055555555786f in test_scan_add_C_INT_C_INT (q=2, excl=<optimized out>, lhi=<optimized out>, llo=<optimized out>)
    at ../global/testing/scan_addc.c:156
#15 main (argc=<optimized out>, argv=<optimized out>) at ../global/testing/scan_addc.c:246

The problematic code line is this: https://github.com/open-mpi/ompi/blob/master/ompi/mca/osc/ucx/osc_ucx_comm.c#L868

The target offset is the offset the user provided but that may not be 64-bit aligned, e.g., for get with MPI_INT.

@janjust @artpol84 Is there no other way to get a request object from UCX than through an atomic operation? Maybe that atomic operation should target an address we know has correct alignment?

devreal avatar Jun 16 '20 19:06 devreal

@bjpalmer Out of curiosity, is your UCX compiled in multi-threaded mode? ./ucx_info -v | grep enable-mt you should see the --enable-mt configure flag set. I think that's the reason behind the hangs and subsequently failures.

I ran the tests with ompi-v4.0.4rc3 ucx-v1.7.x and ucx-v1.8.x and I see: 7/80 and 3/80 failures, respectively.

The common failures were:

953:FAIL: global/testing/scan.x
961:FAIL: global/testing/test.x
962:FAIL: global/testing/overlay.x

ucx-v1.7.x failed in:

980:FAIL: ga++/testing/threadsafecpp.x
981:FAIL: global/testing/thread_perf_contig.x
982:FAIL: global/testing/thread_perf_strided.x
983:FAIL: global/testing/threadsafec.x

but ucx-v1.8.x passed.

Out of curiosity, what were the 3 failures in intel, and mpich?

janjust avatar Jun 17 '20 04:06 janjust

I tried the benchmark with the modified osc/ucx in PR #6980 and I see the same three tests failing after applying the following patch to fix the rput/rget issue:

@@ -1074,7 +1075,7 @@ int ompi_osc_ucx_rput(const void *origin_addr, int origin_count,
     mca_osc_ucx_component.num_incomplete_req_ops++;
     ret = opal_common_ucx_wpmem_fetch_nb(module->mem, UCP_ATOMIC_FETCH_OP_FADD,
                                          0, target, &(module->req_result),
-                                         sizeof(uint64_t), remote_addr,
+                                         sizeof(uint64_t), remote_addr & ~0x7UL,
                                          req_completion, ucx_req);
     if (ret != OMPI_SUCCESS) {
         OMPI_OSC_UCX_REQUEST_RETURN(ucx_req);
@@ -1126,7 +1127,7 @@ int ompi_osc_ucx_rget(void *origin_addr, int origin_count,
     mca_osc_ucx_component.num_incomplete_req_ops++;
     ret = opal_common_ucx_wpmem_fetch_nb(module->mem, UCP_ATOMIC_FETCH_OP_FADD,
                                          0, target, &(module->req_result),
-                                         sizeof(uint64_t), remote_addr,
+                                         sizeof(uint64_t), remote_addr & ~0x7UL,
                                          req_completion, ucx_req);

I dug into the test.x case it a bit and found that osc/ucx does not correctly handle indexed datatypes if the datatype has overlapping entries. In the benchmark, some values may be the target of multiple updates in the same accumulate operation. That does not work well with the get-update-put mechanism used in the osc/ucx as every value in the local copy is updated once and the target values will be overwritten instead of updated (i.e., incremented) twice.

devreal avatar Jun 17 '20 12:06 devreal

@devreal do you mean that master with #6980 fails on 3 tests?

artpol84 avatar Jun 17 '20 14:06 artpol84

@devreal do you mean that master with #6980 fails on 3 tests?

I would be shocked, I'm getting consistent 27/80 failures with master.

janjust avatar Jun 17 '20 14:06 janjust

Sorry, that wasn't entirely clear: both master and #6980 fail for the same reason that I described above. The problematic part didn't change in #6980.

Also: this was on my local shared memory system, there may be more failures on multiple nodes.

devreal avatar Jun 17 '20 14:06 devreal

Can anyone comment on the scale of these tests? I run these across 2 nodes, ppn2. @devreal how many ppns do you run on a single node?

janjust avatar Jun 17 '20 14:06 janjust

I ran them with both a single and four processes. The three failed test cases caused by the indexed data type occur even when running on a single process.

devreal avatar Jun 17 '20 14:06 devreal

A bunch of questions to catch up on:

  1. I looked at the output of ./ucx_info -v for our UCX 1.8.0 build. The only option listed was --prefix.

[d3g293@constance01 bin]$ ucx_info -v

UCT version=1.8.0 revision c30b7da

configured with: --prefix=/share/apps/ucx/1.8.0

  1. The results on Intel MPI/5.1.3.181 cache_test.x hangs sprsmatmult.x hangs ga++/threadsafecpp.x fails

Results for MPICH 3.4a2 ga++/threadsafecpp.x fails threadsafec.x runs but errors present nbtest.x runs but errors present

Results for MVAPICH 2.3.2 nbtestc.x hangs testmult.x hangs sprsmatmult.x fails ga++/testmult.x fails ga++/threadsafecpp.x fails

The remaining tests run but report errors nbtest.x thread_perf_contig.x thread_perf_strided.x threadsafec.x

It looks like nbtest.x has a bug in it, so I would ignore that test. The thread tests are also a bit sketchy, but they appear to pass on a regular basis when run with the progress ranks runtime.

  1. the test suite is designed to run on 4 processors total and should be run on 2 nodes with 2 processors per node. Some tests will fail if run on anything other than 4 processors. For these tests there should be an error message in the output saying that they must be run on 4 processors.

bjpalmer avatar Jun 17 '20 14:06 bjpalmer

@bjpalmer it appears your UCX is not configured with multithreaded support.

jladd-mlnx avatar Jun 17 '20 15:06 jladd-mlnx

Should it be? I can try building it on my own and rebuild 4.0.4.

bjpalmer avatar Jun 17 '20 15:06 bjpalmer