[MPICH 4.3.0][NVHPC 24.9] Intermittent segfault in test/mpi/f08/datatype/structf
Environment: SLES 15.6 x86_64 c5.xlarge single node
MPICH 4.3.0 compiled with NVHPC 24.9 and configured using --enable-debuginfo --enable-shared --with-device=ch4:ucx
The test/mpi/f08/datatype/structf fails with Caught signal 11 (Segmentation fault: address not mapped to object at address 0x5), backtrace such as:
0 mpich-4.3.0/lib/libucs.so.0(+0x3ddd9) [0x7f04e05ebdd9]
1 /lib64/libc.so.6(+0x57980) [0x7f04dbe57980]
2 /lib64/libc.so.6(+0x18ebd0) [0x7f04dbf8ebd0]
3 mpich-4.3.0/lib/libmpi.so.12(+0x2d7002) [0x7f04dead7002]
4 mpich-4.3.0/lib/libmpi.so.12(+0x2dbde3) [0x7f04deadbde3]
5 mpich-4.3.0/lib/libmpi.so.12(PMPI_Unpack+0x3fc) [0x7f04de8c62bc]
6 mpich-4.3.0/lib/libmpifort.so.12(+0xd2ed6b) [0x7f04e052ed6b]
7 mpich-4.3.0/lib/libmpifort.so.12(mpi_unpack_f08ts_+0x29bb) [0x7f04dfaf5b7b]
8 mpich-4.3.0/test/mpi/f08/datatype/structf() [0x401f92]
9 mpich-4.3.0/test/mpi/f08/datatype/structf() [0x401931]
10 /lib64/libc.so.6(+0x40eec) [0x7f04dbe40eec]
11 /lib64/libc.so.6(__libc_start_main+0x87) [0x7f04dbe40fb5]
12 mpich-4.3.0/test/mpi/f08/datatype/structf() [0x401821]
Output such as the following is observed:
cache_put_flush (mpich-4.3.0/src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed
Various other tests in the f08 suite fail.
The assertion error from hydra is unrelated.
Could you use addr2line to locate the source code line positions?
It generated a core file, loading this in gdb gives
#0 0x00007f5982d8eb80 in __memmove_evex_unaligned_erms () from /lib64/libc.so.6
#1 0x00007f59858d7002 in MPIR_Typerep_unpack () at mpich-4.3.0/src/mpi/datatype/typerep/src/typerep_yaksa_pack.c:340
#2 0x00007f59858dbde3 in MPIR_Unpack_impl () at mpich-4.3.0/src/mpi/datatype/datatype_impl.c:87
#3 0x00007f59856c62bc in PMPI_Unpack () at mpich-4.3.0/src/binding/c/datatype/unpack.c:94
#4 0x00007f598732ed6b in MPIR_Unpack_cdesc () at mpich-4.3.0/src/binding/fortran/use_mpi_f08/wrappers_c/f08_cdesc.c:4326
#5 0x00007f59868f5b7b in mpi_unpack_f08ts () at mpich-4.3.0/src/binding/fortran/use_mpi_f08/wrappers_f/f08ts.f90:7272
#6 0x0000000000401ef1 in bustit () at mpich-4.3.0/test/mpi/f08/datatype/structf.f90:74
That is a memcpy. I suspect it is a compiler bug that is try to optimize with vector instructions. And the intermittent-ness is due to randomly triggered alignment issue. Does this only show up with the NVHPC compiler?
Yes, the issue is specific to NVHPC.