ompi osc/ucx: MPI_Win_flush sometimes hangs on intra-node

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.0rc7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a source tarball.

configured with:

../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll

UCX v1.11.0 was configured with:

./contrib/configure-release --prefix=${UCX_PREFIX}

Please describe the system on which you are running

Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
Network type: InfiniBand EDR 4x (100Gbps)

Details of the problem

One-sided communication with osc/ucx sometimes causes my program to hang forever. The hang is in the MPI_Win_flush() call, and it happens with intra-node execution (flat MPI).

Minimal code to reproduce this behaviour:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <mpi.h>

#define CREATE_WIN2 1
#define WIN_ALLOCATE 0

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);

  int rank, nproc;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);

  size_t b_size = 1024;

#if WIN_ALLOCATE
  // If the window is allocated with MPI_Win_allocate, it does not hang
  MPI_Win win1;
  void* baseptr1;
  MPI_Win_allocate(b_size,
                   1,
                   MPI_INFO_NULL,
                   MPI_COMM_WORLD,
                   &baseptr1,
                   &win1);
#else
  int* buf1 = (int*)malloc(b_size);
  MPI_Win win1;
  MPI_Win_create(buf1,
                 b_size,
                 1,
                 MPI_INFO_NULL,
                 MPI_COMM_WORLD,
                 &win1);
#endif
  MPI_Win_lock_all(0, win1);

  // If the second window (win2) is not allocated, it does not hang
#if CREATE_WIN2
#if WIN_ALLOCATE
  MPI_Win win2;
  void* baseptr2;
  MPI_Win_allocate(b_size,
                   1,
                   MPI_INFO_NULL,
                   MPI_COMM_WORLD,
                   &baseptr2,
                   &win2);
#else
  int* buf2 = (int*)malloc(b_size);
  MPI_Win win2;
  MPI_Win_create(buf2,
                 b_size,
                 1,
                 MPI_INFO_NULL,
                 MPI_COMM_WORLD,
                 &win2);
#endif
  MPI_Win_lock_all(0, win2);
#endif

  if (rank == 0) {
    printf("start\n");
  }

  // execute MPI_Get and MPI_Win_flush for randomly chosen processes
  for (int i = 0; i < 10000; i++) {
    int t = rank;
    do {
      t = rand() % nproc;
    } while (t == rank);
    int b;
    MPI_Get(&b, 1, MPI_INT, t, 0, 1, MPI_INT, win1);
    MPI_Win_flush(t, win1); // one of the processes hangs here
  }

  MPI_Barrier(MPI_COMM_WORLD);

  if (rank == 0) {
    printf("end\n");
  }

  // the rest is for finalization
  MPI_Win_unlock_all(win1);
  MPI_Win_free(&win1);

#if CREATE_WIN2
  MPI_Win_unlock_all(win2);
  MPI_Win_free(&win2);
#endif

  MPI_Finalize();

  if (rank == 0) {
    printf("ok\n");
  }

  return 0;
}

Summarizing what I found:

The behaviour is non-deterministic. It does not always hang.
It hangs on intra-node (36 cores in my case).
When the number of processes is small, it rarely hangs.
One of the processes is hanging in MPI_Win_flush() when the execution gets stuck.
If the second window (win2) is not created (CREATE_WIN2=0), it does not hang.
If MPI_Win_allocate() is used instead of MPI_Win_create() (WIN_ALLOCATE=1), it does not hang.

Save the above code (e.g., test_rma.c), compile and run it repeatedly:

$ mpicc test_rma.c
$ for i in $(seq 1 100); do mpirun -n 36 ./a.out; done

The output will look like:

start
end
ok
...
start
end
ok
start
end
ok
start
<hang>

Checking the behaviour of each process by gdb, I found that one of the processes hangs in MPI_Win_flush(), while others have already reached MPI_Barrier().

Backtrace of hanging process (rank 28):

#0  0x00002ac394e01d03 in opal_thread_internal_mutex_lock (p_mutex=0x2ac39441c949 <progress_callback+45>) at ../../../../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:109
#1  0x00002ac394e01d96 in opal_mutex_lock (mutex=0x16e46f8) at ../../../../../opal/mca/threads/mutex.h:122
#2  0x00002ac394e01f75 in opal_common_ucx_wait_request_mt (request=0x171aa10, msg=0x2ac394e4d798 "ucp_ep_flush_nb") at ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:278
#3  0x00002ac394e0421f in opal_common_ucx_winfo_flush (winfo=0x16e46d0, target=27, type=OPAL_COMMON_UCX_FLUSH_B, scope=OPAL_COMMON_UCX_SCOPE_EP, req_ptr=0x0) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:796
#4  0x00002ac394e042db in opal_common_ucx_wpmem_flush (mem=0x16f51e0, scope=OPAL_COMMON_UCX_SCOPE_EP, target=27) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:838
#5  0x00002ac394422fa2 in ompi_osc_ucx_flush (target=27, win=0x15a6410) at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:282
#6  0x00002ac3942f62ed in PMPI_Win_flush (rank=27, win=0x15a6410) at ../../../../ompi/mpi/c/win_flush.c:57
#7  0x0000000000400db1 in main ()

Others:

#0  ucs_callbackq_dispatch (cbq=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/ucs/datastruct/callbackq.h:211
#1  uct_worker_progress (worker=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/uct/api/uct.h:2592
#2  ucp_worker_progress (worker=0x25e7540) at core/ucp_worker.c:2635
#3  0x00002ad7b4cfacb0 in opal_common_ucx_wpool_progress (wpool=0x224ec20) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:281
#4  0x00002ad7b4314949 in progress_callback () at ../../../../../ompi/mca/osc/ucx/osc_ucx_component.c:205
#5  0x00002ad7b4c9b334 in opal_progress () at ../../opal/runtime/opal_progress.c:224
#6  0x00002ad7b415aaff in ompi_request_wait_completion (req=0x2370490) at ../../ompi/request/request.h:488
#7  0x00002ad7b415ab68 in ompi_request_default_wait (req_ptr=0x7ffde5f1b0f0, status=0x7ffde5f1b0d0) at ../../ompi/request/req_wait.c:40
#8  0x00002ad7b42299b4 in ompi_coll_base_sendrecv_zero (dest=3, stag=-16, source=3, rtag=-16, comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:64
#9  0x00002ad7b4229d4a in ompi_coll_base_barrier_intra_recursivedoubling (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:210
#10 0x00002ad7b4240672 in ompi_coll_tuned_barrier_intra_do_this (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0, algorithm=3, faninout=0, segsize=0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_barrier_decision.c:101
#11 0x00002ad7b42397e3 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:500
#12 0x00002ad7b418100b in PMPI_Barrier (comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mpi/c/barrier.c:76
#13 0x0000000000400dc8 in main ()

Jul 12 '22 07:07 s417-lama

@janjust could this be fixed on the latest v5.0.x branch?

Jul 13 '22 13:07 awlauria

Not sure, but I'll take a look today.

Jul 13 '22 13:07 janjust

@s417-lama can you post the hash of UCX that you're using?

I don't see a hang with UCX v1.11.x or later, and with ompi v5.0.rc7 from tarball.

100 iterations np40 or np36 ran to completion

Jul 13 '22 16:07 janjust

I appreciate your investigation.

I built UCX v1.11.0 from tarball, which was downloaded from the following link. https://github.com/openucx/ucx/releases/download/v1.11.0/ucx-1.11.0.tar.gz

The same hang also happens with UCX v1.12.1 in my environment, so I think it is not a version issue of UCX. If the hang is not reproduced in your environment, it might be an environmental issue close to the hardware.

Not sure if it will be helpful, but I show the MLNX_OFED version in my env:

$ ofed_info
MLNX_OFED_LINUX-4.4-1.0.0.0 (OFED-4.4-1.0.0):
...

Jul 14 '22 06:07 s417-lama

Can you try v4.1.x?

Jul 14 '22 13:07 janjust

I built Open MPI v4.1.x from git (the latest v4.1.x branch), also configured with UCX v1.11.0, and checked the behaviour of the same program.

Commit hash: 4fdd439e1ef85983570141c6c0b06945c34483b8

However, I did not encounter any hang with v4.1.x.

What I ran:

for i in $(seq 1 100); do mpirun --mca osc ucx --mca btl_openib_allow_ib true -n 36 ./a.out; done

(Because the following warning was shown, I set --mca btl_openib_allow_ib true.)

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   sca1154
  Local device: mlx5_0
--------------------------------------------------------------------------

(note for v5.0.0rc7) I noticed that the number of iterations for MPI_Get and MPI_Win_flush might be too small to reproduce the hang within 100 runs. The possibility of the hang will be increased by changing the number of repetitions in the loop below:

  for (int i = 0; i < 10000; i++) {
    int t = rank;
    ....
    MPI_Win_flush(t, win1); // one of the processes hangs here
  }

from 10,000 to 100,000, for example.

Jul 14 '22 15:07 s417-lama

Ok thanks, I'll give it a try

Jul 14 '22 15:07 janjust

ompi ompi copied to clipboard

osc/ucx: MPI_Win_flush sometimes hangs on intra-node

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

ompi
ompi copied to clipboard