ompi
ompi copied to clipboard
v4.1: UCX onesided crash (`ibm/onesided/1sided`)
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone of the v4.1.x branch at 0987319a4ac11423fb140ac650f8d835390b54fc
Please describe the system on which you are running
- Operating system/version: RHEL 8.4
- Computer hardware: ppc64le
- Network type: Infiniband with mlx5 cards
Details of the problem
MTT found this problem while testing the v4.1.x branch and running the ibm/onesided/1sided test. I'm using an older UCX (1.11.2) because I have an older MOFED (MLNX_OFED_LINUX-4.9-4.1.1.1) and that is what's supported. So this might be a UCX issue, but I'm not sure.
Open MPI was configured with:
./configure --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio \
--disable-io-ompio --enable-mpi1-compatibility \
--with-ucx=/opt/ucx-1.11.2/ --without-hcoll \
--enable-debug --enable-picky
The test case was run with 3 nodes and 2 processes per node:
mpirun --host f5n18:20,f5n17:20,f5n16:20 --npernode 2 -mca pml ucx -mca osc ucx,sm -mca btl ^openib ./1sided
The test runs for a while, but in phase 8 one or more of the processes with crash:
seed value: 1610634988
[mesgsize 5976]
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:fence]
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:post]
iter: 61525, time: 3.000061 sec
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:test]
iter: 61133, time: 3.000064 sec
...
phase 8 part 3 (fence;loop(i){accum};fence) nc-int chk [st:fence]
[f5n16:2646755:0:2646755] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfa000000fa0)
==== backtrace (tid:2646755) ====
0 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fff825273b4]
1 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37560) [0x7fff82527560]
2 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37990) [0x7fff82527990]
3 linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x7fff835e04d8]
4 [0xfa000000fa0]
5 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(PMPI_Win_fence+0x1a0) [0x7fff8325bdc8]
6 ./1sided() [0x10004538]
7 ./1sided() [0x10004934]
8 ./1sided() [0x100049ac]
9 /lib64/libc.so.6(+0x24c78) [0x7fff82e84c78]
10 /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fff82e84e64]
=================================
[f5n17:1073992:0:1073992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7d0000008a0)
==== backtrace (tid:1073992) ====
0 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fffb73073b4]
1 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37560) [0x7fffb7307560]
2 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37990) [0x7fffb7307990]
3 linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x7fffb83c04d8]
4 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_rc_iface_flush+0xb0) [0x7fffb4401e00]
5 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(+0x6a4e4) [0x7fffb743a4e4]
6 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(ucp_worker_flush_nbx+0x1c4) [0x7fffb743cd94]
7 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(ucp_worker_flush_nb+0x54) [0x7fffb743cf04]
8 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(+0x2e3dc4) [0x7fffb81a3dc4]
9 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(ompi_osc_ucx_fence+0xcc) [0x7fffb81a41c4]
10 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(PMPI_Win_fence+0x1a0) [0x7fffb803bdc8]
11 ./1sided() [0x10004538]
12 ./1sided() [0x10004934]
13 ./1sided() [0x100049ac]
14 /lib64/libc.so.6(+0x24c78) [0x7fffb7c64c78]
15 /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fffb7c64e64]
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 2646755 on node f5n16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)
The stack I got out of gdb is:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fffb1f21e00 in uct_ep_flush (comp=0x0, flags=0, ep=0x10014d454e0)
at /opt/ucx-1.11.2/src/uct/api/uct.h:3050
3050 return ep->iface->ops.ep_flush(ep, flags, comp);
[Current thread is 1 (Thread 0x7fffb5f5e7f0 (LWP 2635823))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.ppc64le libblkid-2.32.1-27.el8.ppc64le libevent-2.1.8-5.el8.ppc64le libgcc-8.4.1-1.el8.ppc64le libibverbs-41mlnx1-OFED.4.9.3.0.0.49411.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.49411.ppc64le libmlx5-41mlnx1-OFED.4.9.0.1.2.49411.ppc64le libmount-2.32.1-27.el8.ppc64le libnl3-3.5.0-1.el8.ppc64le librdmacm-41mlnx1-OFED.4.7.3.0.6.49411.ppc64le libselinux-2.9-5.el8.ppc64le libuuid-2.32.1-27.el8.ppc64le numactl-libs-2.0.12-11.el8.ppc64le openssl-libs-1.1.1g-15.el8_3.ppc64le pcre2-10.32-2.el8.ppc64le systemd-libs-239-45.el8_4.3.ppc64le zlib-1.2.11-17.el8.ppc64le
(gdb) bt
#0 0x00007fffb1f21e00 in uct_ep_flush (comp=0x0, flags=0, ep=0x10014d454e0)
at /opt/ucx-1.11.2/src/uct/api/uct.h:3050
#1 uct_rc_iface_flush (tl_iface=0x10014c3b4b0, flags=<optimized out>, comp=<optimized out>) at rc/base/rc_iface.c:294
#2 0x00007fffb4f5a4e4 in uct_iface_flush (comp=0x0, flags=0, iface=<optimized out>)
at /opt/ucx-1.11.2/src/uct/api/uct.h:2627
#3 ucp_worker_flush_check (worker=0x10014bcc150) at rma/flush.c:411
#4 0x00007fffb4f5cd94 in ucp_worker_flush_nbx_internal (param=0x7fffedc528f0, worker=0x10014bcc150) at rma/flush.c:554
#5 ucp_worker_flush_nbx (worker=0x10014bcc150, param=0x7fffedc528f0) at rma/flush.c:596
#6 0x00007fffb4f5cf04 in ucp_worker_flush_nb (worker=<optimized out>, flags=<optimized out>, cb=<optimized out>) at rma/flush.c:586
#7 0x00007fffb5cc3dc4 in opal_common_ucx_worker_flush (worker=0x10014bcc150) at ../../../../opal/mca/common/ucx/common_ucx.h:179
#8 0x00007fffb5cc41c4 in ompi_osc_ucx_fence (assert=0, win=0x10014b5cb60) at osc_ucx_active_target.c:77
#9 0x00007fffb5b5bdc8 in PMPI_Win_fence (assert=0, win=0x10014b5cb60) at pwin_fence.c:60
#10 0x0000000010004538 in main_test_fn (comm=0x7fffb5ec39a8 <ompi_mpi_comm_world>, tid=1) at 1sided.c:702
#11 0x0000000010004934 in runtest (comm=0x7fffb5ec39a8 <ompi_mpi_comm_world>, tid=1) at 1sided.c:762
#12 0x00000000100049ac in main () at 1sided.c:772
(gdb) l
3045 * upon completion of these operations.
3046 */
3047 UCT_INLINE_API ucs_status_t uct_ep_flush(uct_ep_h ep, unsigned flags,
3048 uct_completion_t *comp)
3049 {
3050 return ep->iface->ops.ep_flush(ep, flags, comp);
3051 }
3052
3053
3054 /**
(gdb) p ep
$1 = (uct_ep_h) 0x10014d454e0
(gdb) p ep->iface
$2 = (uct_iface_h) 0x138800001388
(gdb) p ep->iface->ops
Cannot access memory at address 0x138800001388
MTT had a slightly different signature than my manual run above (though it ran with --disable-debug):
phase 8 part 2 (fence;loop(i){accum};fence) c-int nochk [st:lock]
iter: 1000, time: 0.016633 sec
phase 8 part 3 (fence;loop(i){accum};fence) nc-int chk [st:fence]
[1652847615.199257] [gnu-ompi-mtt-cn-1:57362:0] rma_send.c:277 UCX ERROR cannot use a remote key on a different endpoint than it was unpacked on
[1652847615.199260] [gnu-ompi-mtt-cn-1:57363:0] rma_send.c:277 UCX ERROR cannot use a remote key on a different endpoint than it was unpacked on
[gnu-ompi-mtt-cn-1:57362] *** An error occurred in MPI_Accumulate
[gnu-ompi-mtt-cn-1:57362] *** reported by process [1298268161,2]
[gnu-ompi-mtt-cn-1:57362] *** on win ucx window 3
[gnu-ompi-mtt-cn-1:57362] *** MPI_ERR_OTHER: known error not in list
[gnu-ompi-mtt-cn-1:57362] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[gnu-ompi-mtt-cn-1:57362] *** and potentially your MPI job)
[gnu-ompi-mtt-cn-0:57755:0:57755] ib_mlx5_log.c:174 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[gnu-ompi-mtt-cn-0:57755:0:57755] ib_mlx5_log.c:174 RC QP 0x19921 wqe[235]: CSWAP s-- [rva 0x10032a51760 rkey 0x2400] [cmp 0 swap 4294967296] [va 0x7fff9c17fd78 len 8 lkey 0x9466e] [rqpn 0x384c7 dlid=13 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 57755) ====
0 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fffb89373b4]
1 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_fatal_error_message+0x118) [0x7fffb8932258]
2 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_log_default_handler+0x1388) [0x7fffb8939768]
3 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_log_dispatch+0xc0) [0x7fffb8939a30]
4 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x624) [0x7fffb5e4a134]
5 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(+0x4ee0c) [0x7fffb5e6ee0c]
6 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0xc4) [0x7fffb5e4af64]
7 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(+0x51e1c) [0x7fffb5e71e1c]
8 /opt/ucx-1.11.2/lib/libucp.so.0(ucp_worker_progress+0x64) [0x7fffb8a4c074]
9 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d98cc) [0x7fffb97f98cc]
10 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d9a10) [0x7fffb97f9a10]
11 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d9aa0) [0x7fffb97f9aa0]
12 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2dac3c) [0x7fffb97fac3c]
13 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(ompi_osc_ucx_accumulate+0xe0) [0x7fffb97fba88]
14 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(PMPI_Accumulate+0x5cc) [0x7fffb9695ba8]
15 onesided/1sided() [0x10002f38]
16 onesided/1sided() [0x1000442c]
17 onesided/1sided() [0x10004914]
18 onesided/1sided() [0x1000498c]
19 /lib64/libc.so.6(+0x24c78) [0x7fffb92c4c78]
20 /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fffb92c4e64]