mpich icon indicating copy to clipboard operation
mpich copied to clipboard

[MPICH 4.3.0][NVHPC 24.9] Intermittent segfault in test/mpi/f08/datatype/structf

Open david-edwards-linaro opened this issue 8 months ago • 4 comments

Environment: SLES 15.6 x86_64 c5.xlarge single node MPICH 4.3.0 compiled with NVHPC 24.9 and configured using --enable-debuginfo --enable-shared --with-device=ch4:ucx

The test/mpi/f08/datatype/structf fails with Caught signal 11 (Segmentation fault: address not mapped to object at address 0x5), backtrace such as:

 0  mpich-4.3.0/lib/libucs.so.0(+0x3ddd9) [0x7f04e05ebdd9]
 1  /lib64/libc.so.6(+0x57980) [0x7f04dbe57980]
 2  /lib64/libc.so.6(+0x18ebd0) [0x7f04dbf8ebd0]
 3  mpich-4.3.0/lib/libmpi.so.12(+0x2d7002) [0x7f04dead7002]
 4  mpich-4.3.0/lib/libmpi.so.12(+0x2dbde3) [0x7f04deadbde3]
 5  mpich-4.3.0/lib/libmpi.so.12(PMPI_Unpack+0x3fc) [0x7f04de8c62bc]
 6  mpich-4.3.0/lib/libmpifort.so.12(+0xd2ed6b) [0x7f04e052ed6b]
 7  mpich-4.3.0/lib/libmpifort.so.12(mpi_unpack_f08ts_+0x29bb) [0x7f04dfaf5b7b]
 8  mpich-4.3.0/test/mpi/f08/datatype/structf() [0x401f92]
 9  mpich-4.3.0/test/mpi/f08/datatype/structf() [0x401931]
10  /lib64/libc.so.6(+0x40eec) [0x7f04dbe40eec]
11  /lib64/libc.so.6(__libc_start_main+0x87) [0x7f04dbe40fb5]
12  mpich-4.3.0/test/mpi/f08/datatype/structf() [0x401821]

Output such as the following is observed: cache_put_flush (mpich-4.3.0/src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed

Various other tests in the f08 suite fail.

david-edwards-linaro avatar Apr 24 '25 15:04 david-edwards-linaro

The assertion error from hydra is unrelated.

Could you use addr2line to locate the source code line positions?

hzhou avatar Apr 24 '25 17:04 hzhou

It generated a core file, loading this in gdb gives

#0  0x00007f5982d8eb80 in __memmove_evex_unaligned_erms () from /lib64/libc.so.6
#1  0x00007f59858d7002 in MPIR_Typerep_unpack () at mpich-4.3.0/src/mpi/datatype/typerep/src/typerep_yaksa_pack.c:340
#2  0x00007f59858dbde3 in MPIR_Unpack_impl () at mpich-4.3.0/src/mpi/datatype/datatype_impl.c:87
#3  0x00007f59856c62bc in PMPI_Unpack () at mpich-4.3.0/src/binding/c/datatype/unpack.c:94
#4  0x00007f598732ed6b in MPIR_Unpack_cdesc () at mpich-4.3.0/src/binding/fortran/use_mpi_f08/wrappers_c/f08_cdesc.c:4326
#5  0x00007f59868f5b7b in mpi_unpack_f08ts () at mpich-4.3.0/src/binding/fortran/use_mpi_f08/wrappers_f/f08ts.f90:7272
#6  0x0000000000401ef1 in bustit () at mpich-4.3.0/test/mpi/f08/datatype/structf.f90:74

david-edwards-linaro avatar Apr 24 '25 18:04 david-edwards-linaro

That is a memcpy. I suspect it is a compiler bug that is try to optimize with vector instructions. And the intermittent-ness is due to randomly triggered alignment issue. Does this only show up with the NVHPC compiler?

hzhou avatar Apr 24 '25 19:04 hzhou

Yes, the issue is specific to NVHPC.

david-edwards-linaro avatar Apr 24 '25 19:04 david-edwards-linaro