ompi icon indicating copy to clipboard operation
ompi copied to clipboard

OpenMPI + UCX fails in MPI_Win_post and works with MPI_fence

Open thomasgillis opened this issue 3 years ago • 12 comments

Dear OpenMPI team,

I am having issues wit the RMA part of OpenMPI (v4.1.2) and especially with the Post-Start-Complete-Wait synchronization. Our code runs with ucx and fails in an assertion at ucp_ep.inl:231 as described below. When a fence-type synchronization is used the same code runs without issues.

Thanks for your time and your help.

EDIT: I also had an issue with the rdma flavor of osc but I have submitted it as a different issue as I think it's a configuration issue and not a code issue.

Background information

We have a code using MPI one sided communications with a PSCW approach.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

I am on OpenMPI 4.1.2 + UCX 11.2, see info files bellow

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Downloaded from https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.gz

info files

ompi_info ucx_info.txt


Details of the problem

The code fails when used with ucx (mpirun or mpirun --mca osc ucx):

[node1019:163756:0:163756]    ucp_ep.inl:231  Assertion `!(ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID) || ((flush_state->send_sn == 0) && (flush_state->cmpl_sn == 0) && ucs_hlist_is_empty(&flush_state->reqs))' failed
==== backtrace (tid: 163756) ====
 0  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucs.so.0(ucs_handle_error+0x264) [0x2aaabe7e3254]
 1  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x2aaabe7dffb0]
 2  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucs.so.0(ucs_fatal_error_format+0xde) [0x2aaabe7e012e]
 3  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucp.so.0(+0x9aefa) [0x2aaabe31defa]
 4  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucp.so.0(+0x9e6d1) [0x2aaabe3216d1]
 5  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x21c) [0x2aaabee662cc]
 6  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/ucx/libuct_ib.so.0(+0x2d8d1) [0x2aaabee698d1]
 7  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aaabe2c0c4a]
 8  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_post+0x5cb) [0x2aaac36da3fb]
 9  /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libmpi.so.40(MPI_Win_post+0xab) [0x2aaaabce6e5b]

I tried the following debug parameters:

--mca mpi_abort_print_stack 1 --mca mpi_show_mca_params 1 --mca mpi_param_check 1 --mca osc_base_verbose 100 --mca osc_rdma_verbose 100 --mca osc_ucx_verbose 100

but I couldn't get more info on the error

thomasgillis avatar Feb 02 '22 19:02 thomasgillis