ompi
ompi copied to clipboard
osc/ucx: MPI_Win_flush sometimes hangs on intra-node
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.0rc7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a source tarball.
configured with:
../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll
UCX v1.11.0 was configured with:
./contrib/configure-release --prefix=${UCX_PREFIX}
Please describe the system on which you are running
- Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
- Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
- Network type: InfiniBand EDR 4x (100Gbps)
Details of the problem
One-sided communication with osc/ucx sometimes causes my program to hang forever.
The hang is in the MPI_Win_flush() call, and it happens with intra-node execution (flat MPI).
Minimal code to reproduce this behaviour:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
#define CREATE_WIN2 1
#define WIN_ALLOCATE 0
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, nproc;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
size_t b_size = 1024;
#if WIN_ALLOCATE
// If the window is allocated with MPI_Win_allocate, it does not hang
MPI_Win win1;
void* baseptr1;
MPI_Win_allocate(b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&baseptr1,
&win1);
#else
int* buf1 = (int*)malloc(b_size);
MPI_Win win1;
MPI_Win_create(buf1,
b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&win1);
#endif
MPI_Win_lock_all(0, win1);
// If the second window (win2) is not allocated, it does not hang
#if CREATE_WIN2
#if WIN_ALLOCATE
MPI_Win win2;
void* baseptr2;
MPI_Win_allocate(b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&baseptr2,
&win2);
#else
int* buf2 = (int*)malloc(b_size);
MPI_Win win2;
MPI_Win_create(buf2,
b_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&win2);
#endif
MPI_Win_lock_all(0, win2);
#endif
if (rank == 0) {
printf("start\n");
}
// execute MPI_Get and MPI_Win_flush for randomly chosen processes
for (int i = 0; i < 10000; i++) {
int t = rank;
do {
t = rand() % nproc;
} while (t == rank);
int b;
MPI_Get(&b, 1, MPI_INT, t, 0, 1, MPI_INT, win1);
MPI_Win_flush(t, win1); // one of the processes hangs here
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf("end\n");
}
// the rest is for finalization
MPI_Win_unlock_all(win1);
MPI_Win_free(&win1);
#if CREATE_WIN2
MPI_Win_unlock_all(win2);
MPI_Win_free(&win2);
#endif
MPI_Finalize();
if (rank == 0) {
printf("ok\n");
}
return 0;
}
Summarizing what I found:
- The behaviour is non-deterministic. It does not always hang.
- It hangs on intra-node (36 cores in my case).
- When the number of processes is small, it rarely hangs.
- One of the processes is hanging in
MPI_Win_flush()when the execution gets stuck. - If the second window (
win2) is not created (CREATE_WIN2=0), it does not hang. - If
MPI_Win_allocate()is used instead ofMPI_Win_create()(WIN_ALLOCATE=1), it does not hang.
Save the above code (e.g., test_rma.c), compile and run it repeatedly:
$ mpicc test_rma.c
$ for i in $(seq 1 100); do mpirun -n 36 ./a.out; done
The output will look like:
start
end
ok
...
start
end
ok
start
end
ok
start
<hang>
Checking the behaviour of each process by gdb, I found that one of the processes hangs in MPI_Win_flush(), while others have already reached MPI_Barrier().
Backtrace of hanging process (rank 28):
#0 0x00002ac394e01d03 in opal_thread_internal_mutex_lock (p_mutex=0x2ac39441c949 <progress_callback+45>) at ../../../../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:109
#1 0x00002ac394e01d96 in opal_mutex_lock (mutex=0x16e46f8) at ../../../../../opal/mca/threads/mutex.h:122
#2 0x00002ac394e01f75 in opal_common_ucx_wait_request_mt (request=0x171aa10, msg=0x2ac394e4d798 "ucp_ep_flush_nb") at ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:278
#3 0x00002ac394e0421f in opal_common_ucx_winfo_flush (winfo=0x16e46d0, target=27, type=OPAL_COMMON_UCX_FLUSH_B, scope=OPAL_COMMON_UCX_SCOPE_EP, req_ptr=0x0) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:796
#4 0x00002ac394e042db in opal_common_ucx_wpmem_flush (mem=0x16f51e0, scope=OPAL_COMMON_UCX_SCOPE_EP, target=27) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:838
#5 0x00002ac394422fa2 in ompi_osc_ucx_flush (target=27, win=0x15a6410) at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:282
#6 0x00002ac3942f62ed in PMPI_Win_flush (rank=27, win=0x15a6410) at ../../../../ompi/mpi/c/win_flush.c:57
#7 0x0000000000400db1 in main ()
Others:
#0 ucs_callbackq_dispatch (cbq=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/ucs/datastruct/callbackq.h:211
#1 uct_worker_progress (worker=<optimized out>) at /.../ucx/1.11.0/ucx-1.11.0/src/uct/api/uct.h:2592
#2 ucp_worker_progress (worker=0x25e7540) at core/ucp_worker.c:2635
#3 0x00002ad7b4cfacb0 in opal_common_ucx_wpool_progress (wpool=0x224ec20) at ../../../../../opal/mca/common/ucx/common_ucx_wpool.c:281
#4 0x00002ad7b4314949 in progress_callback () at ../../../../../ompi/mca/osc/ucx/osc_ucx_component.c:205
#5 0x00002ad7b4c9b334 in opal_progress () at ../../opal/runtime/opal_progress.c:224
#6 0x00002ad7b415aaff in ompi_request_wait_completion (req=0x2370490) at ../../ompi/request/request.h:488
#7 0x00002ad7b415ab68 in ompi_request_default_wait (req_ptr=0x7ffde5f1b0f0, status=0x7ffde5f1b0d0) at ../../ompi/request/req_wait.c:40
#8 0x00002ad7b42299b4 in ompi_coll_base_sendrecv_zero (dest=3, stag=-16, source=3, rtag=-16, comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:64
#9 0x00002ad7b4229d4a in ompi_coll_base_barrier_intra_recursivedoubling (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:210
#10 0x00002ad7b4240672 in ompi_coll_tuned_barrier_intra_do_this (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0, algorithm=3, faninout=0, segsize=0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_barrier_decision.c:101
#11 0x00002ad7b42397e3 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x6023c0 <ompi_mpi_comm_world>, module=0x232b5b0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:500
#12 0x00002ad7b418100b in PMPI_Barrier (comm=0x6023c0 <ompi_mpi_comm_world>) at ../../../../ompi/mpi/c/barrier.c:76
#13 0x0000000000400dc8 in main ()
@janjust could this be fixed on the latest v5.0.x branch?
Not sure, but I'll take a look today.
@s417-lama can you post the hash of UCX that you're using?
I don't see a hang with UCX v1.11.x or later, and with ompi v5.0.rc7 from tarball.
100 iterations np40 or np36 ran to completion
I appreciate your investigation.
I built UCX v1.11.0 from tarball, which was downloaded from the following link. https://github.com/openucx/ucx/releases/download/v1.11.0/ucx-1.11.0.tar.gz
The same hang also happens with UCX v1.12.1 in my environment, so I think it is not a version issue of UCX. If the hang is not reproduced in your environment, it might be an environmental issue close to the hardware.
Not sure if it will be helpful, but I show the MLNX_OFED version in my env:
$ ofed_info
MLNX_OFED_LINUX-4.4-1.0.0.0 (OFED-4.4-1.0.0):
...
Can you try v4.1.x?
I built Open MPI v4.1.x from git (the latest v4.1.x branch), also configured with UCX v1.11.0, and checked the behaviour of the same program.
Commit hash: 4fdd439e1ef85983570141c6c0b06945c34483b8
However, I did not encounter any hang with v4.1.x.
What I ran:
for i in $(seq 1 100); do mpirun --mca osc ucx --mca btl_openib_allow_ib true -n 36 ./a.out; done
(Because the following warning was shown, I set --mca btl_openib_allow_ib true.)
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: sca1154
Local device: mlx5_0
--------------------------------------------------------------------------
(note for v5.0.0rc7) I noticed that the number of iterations for MPI_Get and MPI_Win_flush might be too small to reproduce the hang within 100 runs.
The possibility of the hang will be increased by changing the number of repetitions in the loop below:
for (int i = 0; i < 10000; i++) {
int t = rank;
....
MPI_Win_flush(t, win1); // one of the processes hangs here
}
from 10,000 to 100,000, for example.
Ok thanks, I'll give it a try