ompi
ompi copied to clipboard
OpenMPI + UCX fails in MPI_Win_post and works with MPI_fence
Dear OpenMPI team,
I am having issues wit the RMA part of OpenMPI (v4.1.2) and especially with the Post-Start-Complete-Wait synchronization. Our code runs with ucx and fails in an assertion at ucp_ep.inl:231 as described below.
When a fence-type synchronization is used the same code runs without issues.
Thanks for your time and your help.
EDIT: I also had an issue with the rdma flavor of osc but I have submitted it as a different issue as I think it's a configuration issue and not a code issue.
Background information
We have a code using MPI one sided communications with a PSCW approach.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
I am on OpenMPI 4.1.2 + UCX 11.2, see info files bellow
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Downloaded from https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.gz
info files
Details of the problem
The code fails when used with ucx (mpirun or mpirun --mca osc ucx):
[node1019:163756:0:163756] ucp_ep.inl:231 Assertion `!(ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID) || ((flush_state->send_sn == 0) && (flush_state->cmpl_sn == 0) && ucs_hlist_is_empty(&flush_state->reqs))' failed
==== backtrace (tid: 163756) ====
0 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucs.so.0(ucs_handle_error+0x264) [0x2aaabe7e3254]
1 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x2aaabe7dffb0]
2 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucs.so.0(ucs_fatal_error_format+0xde) [0x2aaabe7e012e]
3 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucp.so.0(+0x9aefa) [0x2aaabe31defa]
4 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucp.so.0(+0x9e6d1) [0x2aaabe3216d1]
5 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x21c) [0x2aaabee662cc]
6 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/ucx/libuct_ib.so.0(+0x2d8d1) [0x2aaabee698d1]
7 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aaabe2c0c4a]
8 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_post+0x5cb) [0x2aaac36da3fb]
9 /home/tgillis/lib-OpenMPI-4.1.2-UCX-1.11.2-GCC-11.2/lib/libmpi.so.40(MPI_Win_post+0xab) [0x2aaaabce6e5b]
I tried the following debug parameters:
--mca mpi_abort_print_stack 1 --mca mpi_show_mca_params 1 --mca mpi_param_check 1 --mca osc_base_verbose 100 --mca osc_rdma_verbose 100 --mca osc_ucx_verbose 100
but I couldn't get more info on the error