mpich icon indicating copy to clipboard operation
mpich copied to clipboard

mpich-4.1a1 build failure with ch4:ucx, gcc-12

Open jedbrown opened this issue 3 years ago • 12 comments

I just tested an update to the archlinux package and got this build failure. I can't reproduce this moment, but let me know if you can't reproduce and I'll try reducing.

In file included from ../../../../../modules/ucx/src/ucs/debug/debug.c:27:
../../../../../modules/ucx/src/ucs/debug/debug.c: In function ‘load_file’:
../../../../../modules/ucx/src/ucs/debug/debug.c:277:53: error: ‘PTR’ undeclared (first use in this function)
  277 |     symcount = bfd_read_minisymbols(file->abfd, 0, (PTR)&file->syms, &size);
      |                                                     ^~~
../../../../../modules/ucx/src/ucs/debug/debug.c:277:53: note: each undeclared identifier is reported only once for each function it appears i
  ../configure --prefix=/opt/mpich \
               --with-device=ch4:ucx \
               --with-hwloc \
               --without-java \
               --enable-error-checking=runtime \
               --enable-error-messages=all \
               --enable-g=meminit \
               CC=gcc CXX=g++ FC=gfortran \
               FFLAGS=-fallow-argument-mismatch \
               FCFLAGS=-fallow-argument-mismatch

jedbrown avatar Aug 09 '22 21:08 jedbrown

Is there an archlinux package for ucx and could you try using that as a dependency? I believe mpich will use external ucx if detected.

hzhou avatar Aug 09 '22 22:08 hzhou

FFLAGS=-fallow-argument-mismatch
FCFLAGS=-fallow-argument-mismatch

Also could you try without those flags? We have updated our fortran binding in 4.1 so that those flags are no longer necessary.

hzhou avatar Aug 09 '22 22:08 hzhou

There is. It didn't work with mpich when I first tried it (a year or two ago), but I'll try again.

jedbrown avatar Aug 09 '22 22:08 jedbrown

I believe PTR is just void *. It is probably an internal macro defined in one of the binutil headers that has gone away in the recent releases. Maybe PTR is defined to a more specific pointer type, but if we need cast here, void * will do

hzhou avatar Aug 15 '22 02:08 hzhou

Solution 1: add configure option --without-bfd. UCX lose the ability of printing detailed backtrace (that I never find helpful) in case of segfaults.

Solution 2: sed -i -e 's/\<PTR\>/void */' modules/ucx/src/ucs/debug/debug.c

hzhou avatar Aug 15 '22 03:08 hzhou

https://github.com/openucx/ucx/pull/8450

hzhou avatar Aug 15 '22 04:08 hzhou

Thanks. In a fresh build of mpich-4.1a1 with this ucx patch, I get SEGV. This is on an AMD laptop with ROCm tools installed, and I guess it's either ucx or rocm misconfiguration, though I haven't seen other observable issues. Do you have suggestions for how to debug efficiently?

$ make CC=/opt/mpich/bin/mpicc CFLAGS='-Wall -Wextra' -B initnull
/opt/mpich/bin/mpicc -Wall -Wextra    initnull.c   -o initnull
$ ./initnull
[kichatna:1828195:0:1828195] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:1828195) ====
 0 0x0000000000038a40 __sigaction()  ???:0
 1 0x0000000000013e3b uct_base_iface_t_init()  ???:0
 2 0x0000000000006a16 uct_rocm_copy_ep_get_short()  ???:0
 3 0x0000000000006b5c uct_rocm_copy_ep_get_short()  ???:0
 4 0x0000000000011ce9 uct_iface_open()  ???:0
 5 0x000000000003e57f ucp_worker_iface_open()  ???:0
 6 0x000000000003f0ba ucp_worker_iface_cleanup()  ???:0
 7 0x0000000000040f16 ucp_worker_create()  ???:0
 8 0x000000000037b6ae MPII_Grequest_set_lang_f77()  ???:0
 9 0x000000000037bf5f MPII_Grequest_set_lang_f77()  ???:0
10 0x000000000035a2ca MPII_Grequest_set_lang_f77()  ???:0
11 0x00000000002d4187 MPII_Op_set_cxx()  ???:0
12 0x00000000002d5ca1 MPII_Op_set_cxx()  ???:0
13 0x00000000002d2add MPII_Op_set_cxx()  ???:0
14 0x00000000002fba36 MPIR_Err_create_code()  ???:0
15 0x00000000002fbf7b MPIR_Err_create_code()  ???:0
16 0x000000000014b015 MPI_Init()  ???:0
17 0x000000000000115c main()  ???:0
18 0x00000000000232d0 __libc_init_first()  ???:0
19 0x000000000002338a __libc_start_main()  ???:0
20 0x0000000000001075 _start()  /build/glibc/src/glibc/csu/../sysdeps/x86_64/start.S:115
=================================
Segmentation fault (core dumped)

jedbrown avatar Aug 15 '22 16:08 jedbrown

The backtrace doesn't make sense -- how would MPIR_Err_create_code -> MPII_Op_set_cxx -> MPII_Grequest_set_lang_f77 -> ucp_worker_create? That is why I don't find ucx's backtrace ever useful. Simply try gdb ./initnull and see if you can get a better backtrace.

hzhou avatar Aug 15 '22 18:08 hzhou

Here's a gdb trace. I'd have to rebuild with better debugging symbols to make it nicer.

#0  0x00007ffff52dee3b in uct_base_iface_t_init () from /opt/mpich/lib/libuct.so.0
#1  0x00007ffff4f44a16 in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#2  0x00007ffff4f44b5c in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#3  0x00007ffff52dcce9 in uct_iface_open () from /opt/mpich/lib/libuct.so.0
#4  0x00007ffff535057f in ucp_worker_iface_open () from /opt/mpich/lib/libucp.so.0
#5  0x00007ffff53510ba in ?? () from /opt/mpich/lib/libucp.so.0
#6  0x00007ffff5352f16 in ucp_worker_create () from /opt/mpich/lib/libucp.so.0
#7  0x00007ffff5a276ae in ?? () from /opt/mpich/lib/libmpi.so.0
#8  0x00007ffff5a27f5f in ?? () from /opt/mpich/lib/libmpi.so.0
#9  0x00007ffff5a062ca in ?? () from /opt/mpich/lib/libmpi.so.0
#10 0x00007ffff5980187 in ?? () from /opt/mpich/lib/libmpi.so.0
#11 0x00007ffff5981ca1 in ?? () from /opt/mpich/lib/libmpi.so.0
#12 0x00007ffff597eadd in ?? () from /opt/mpich/lib/libmpi.so.0
#13 0x00007ffff59a7a36 in ?? () from /opt/mpich/lib/libmpi.so.0
#14 0x00007ffff59a7f7b in ?? () from /opt/mpich/lib/libmpi.so.0
#15 0x00007ffff57f7015 in PMPI_Init () from /opt/mpich/lib/libmpi.so.0
#16 0x000055555555515c in main ()

jedbrown avatar Aug 15 '22 19:08 jedbrown

Use --enable-g=dbg to add -g option to the compiler.

hzhou avatar Aug 15 '22 19:08 hzhou

I'm not following why it still swallows. I added --enable-g=dbg and added -O -g to MPICHLIB_CFLAGS. (-O0 conflicts with fortify, and thus fails.)

(gdb) bt
#0  0x00007ffff5316653 in uct_base_iface_t_init () from /opt/mpich/lib/libuct.so.0
#1  0x00007ffff4f832c4 in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#2  0x00007ffff4f838f8 in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#3  0x00007ffff5314661 in uct_iface_open () from /opt/mpich/lib/libuct.so.0
#4  0x00007ffff5383e6e in ucp_worker_iface_open () from /opt/mpich/lib/libucp.so.0
#5  0x00007ffff53844d1 in ?? () from /opt/mpich/lib/libucp.so.0
#6  0x00007ffff53868e9 in ucp_worker_create () from /opt/mpich/lib/libucp.so.0
#7  0x00007ffff5a2ed8e in ?? () from /opt/mpich/lib/libmpi.so.0
#8  0x00007ffff5a2f63f in ?? () from /opt/mpich/lib/libmpi.so.0
#9  0x00007ffff5a1265a in ?? () from /opt/mpich/lib/libmpi.so.0
#10 0x00007ffff5993e47 in ?? () from /opt/mpich/lib/libmpi.so.0
#11 0x00007ffff5995731 in ?? () from /opt/mpich/lib/libmpi.so.0
#12 0x00007ffff5992e1d in ?? () from /opt/mpich/lib/libmpi.so.0
#13 0x00007ffff59b9656 in ?? () from /opt/mpich/lib/libmpi.so.0
#14 0x00007ffff59b9c2b in ?? () from /opt/mpich/lib/libmpi.so.0
#15 0x00007ffff5829ef5 in PMPI_Init () from /opt/mpich/lib/libmpi.so.0
#16 0x000055555555515c in main () at initnull.c:5

I don't know why the extra -O2 gets slapped on at the end, but that wouldn't prevent basic debugging in the trace. Are my other configure flags somehow conflicting?

$ mpichversion
MPICH Version:          4.1a1
MPICH Release date:     Fri May  6 15:51:18 CDT 2022
MPICH Device:           ch4:ucx
MPICH configure:        --prefix=/opt/mpich --with-device=ch4:ucx --with-hwloc --without-java --enable-error-checking=runtime --enable-error-messages=all --enable-g=dbg CC=gcc CXX=g++ FC=gfortran
MPICH CC:       gcc  -march=native -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -Wno-error=array-bounds -O -g  -O2
MPICH CXX:      g++  -march=native -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -O -g -O2
MPICH F77:      gfortran   -g -O2
MPICH FC:       gfortran   -g -O2
MPICH Custom Information:

jedbrown avatar Aug 17 '22 18:08 jedbrown

Nevertheless, they consistently point to the ucx stack during Init. We may get better luck escalating to the UCX folks.

hzhou avatar Aug 17 '22 18:08 hzhou

I'm kind of sad to see mpich-4.1 ship without a fixed UCX version, meaning that standard builds will fail anywhere with binutils-2.39.

jedbrown avatar Jan 29 '23 15:01 jedbrown

Ahh, we forgot about this ticket. Thanks for the reminder. We'll make sure to have it fixed and backport it to 4.1.1. cc @raffenet

hzhou avatar Jan 29 '23 16:01 hzhou

Thank you! If that'll come soon, we can just update PETSc to use it, otherwise we'll add the workaround.

jedbrown avatar Jan 29 '23 16:01 jedbrown

@raffenet, I think this could be closed now?

prj- avatar Mar 21 '23 13:03 prj-

Yes. Thanks for the reminder! Fix was included in main and 4.1.x branches. Released in 4.1.1.

raffenet avatar Mar 21 '23 14:03 raffenet