mpich-4.1a1 build failure with ch4:ucx, gcc-12
I just tested an update to the archlinux package and got this build failure. I can't reproduce this moment, but let me know if you can't reproduce and I'll try reducing.
In file included from ../../../../../modules/ucx/src/ucs/debug/debug.c:27:
../../../../../modules/ucx/src/ucs/debug/debug.c: In function ‘load_file’:
../../../../../modules/ucx/src/ucs/debug/debug.c:277:53: error: ‘PTR’ undeclared (first use in this function)
277 | symcount = bfd_read_minisymbols(file->abfd, 0, (PTR)&file->syms, &size);
| ^~~
../../../../../modules/ucx/src/ucs/debug/debug.c:277:53: note: each undeclared identifier is reported only once for each function it appears i
../configure --prefix=/opt/mpich \
--with-device=ch4:ucx \
--with-hwloc \
--without-java \
--enable-error-checking=runtime \
--enable-error-messages=all \
--enable-g=meminit \
CC=gcc CXX=g++ FC=gfortran \
FFLAGS=-fallow-argument-mismatch \
FCFLAGS=-fallow-argument-mismatch
Is there an archlinux package for ucx and could you try using that as a dependency? I believe mpich will use external ucx if detected.
FFLAGS=-fallow-argument-mismatch
FCFLAGS=-fallow-argument-mismatch
Also could you try without those flags? We have updated our fortran binding in 4.1 so that those flags are no longer necessary.
There is. It didn't work with mpich when I first tried it (a year or two ago), but I'll try again.
I believe PTR is just void *. It is probably an internal macro defined in one of the binutil headers that has gone away in the recent releases. Maybe PTR is defined to a more specific pointer type, but if we need cast here, void * will do
Solution 1: add configure option --without-bfd. UCX lose the ability of printing detailed backtrace (that I never find helpful) in case of segfaults.
Solution 2: sed -i -e 's/\<PTR\>/void */' modules/ucx/src/ucs/debug/debug.c
https://github.com/openucx/ucx/pull/8450
Thanks. In a fresh build of mpich-4.1a1 with this ucx patch, I get SEGV. This is on an AMD laptop with ROCm tools installed, and I guess it's either ucx or rocm misconfiguration, though I haven't seen other observable issues. Do you have suggestions for how to debug efficiently?
$ make CC=/opt/mpich/bin/mpicc CFLAGS='-Wall -Wextra' -B initnull
/opt/mpich/bin/mpicc -Wall -Wextra initnull.c -o initnull
$ ./initnull
[kichatna:1828195:0:1828195] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:1828195) ====
0 0x0000000000038a40 __sigaction() ???:0
1 0x0000000000013e3b uct_base_iface_t_init() ???:0
2 0x0000000000006a16 uct_rocm_copy_ep_get_short() ???:0
3 0x0000000000006b5c uct_rocm_copy_ep_get_short() ???:0
4 0x0000000000011ce9 uct_iface_open() ???:0
5 0x000000000003e57f ucp_worker_iface_open() ???:0
6 0x000000000003f0ba ucp_worker_iface_cleanup() ???:0
7 0x0000000000040f16 ucp_worker_create() ???:0
8 0x000000000037b6ae MPII_Grequest_set_lang_f77() ???:0
9 0x000000000037bf5f MPII_Grequest_set_lang_f77() ???:0
10 0x000000000035a2ca MPII_Grequest_set_lang_f77() ???:0
11 0x00000000002d4187 MPII_Op_set_cxx() ???:0
12 0x00000000002d5ca1 MPII_Op_set_cxx() ???:0
13 0x00000000002d2add MPII_Op_set_cxx() ???:0
14 0x00000000002fba36 MPIR_Err_create_code() ???:0
15 0x00000000002fbf7b MPIR_Err_create_code() ???:0
16 0x000000000014b015 MPI_Init() ???:0
17 0x000000000000115c main() ???:0
18 0x00000000000232d0 __libc_init_first() ???:0
19 0x000000000002338a __libc_start_main() ???:0
20 0x0000000000001075 _start() /build/glibc/src/glibc/csu/../sysdeps/x86_64/start.S:115
=================================
Segmentation fault (core dumped)
The backtrace doesn't make sense -- how would MPIR_Err_create_code -> MPII_Op_set_cxx -> MPII_Grequest_set_lang_f77 -> ucp_worker_create? That is why I don't find ucx's backtrace ever useful. Simply try gdb ./initnull and see if you can get a better backtrace.
Here's a gdb trace. I'd have to rebuild with better debugging symbols to make it nicer.
#0 0x00007ffff52dee3b in uct_base_iface_t_init () from /opt/mpich/lib/libuct.so.0
#1 0x00007ffff4f44a16 in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#2 0x00007ffff4f44b5c in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#3 0x00007ffff52dcce9 in uct_iface_open () from /opt/mpich/lib/libuct.so.0
#4 0x00007ffff535057f in ucp_worker_iface_open () from /opt/mpich/lib/libucp.so.0
#5 0x00007ffff53510ba in ?? () from /opt/mpich/lib/libucp.so.0
#6 0x00007ffff5352f16 in ucp_worker_create () from /opt/mpich/lib/libucp.so.0
#7 0x00007ffff5a276ae in ?? () from /opt/mpich/lib/libmpi.so.0
#8 0x00007ffff5a27f5f in ?? () from /opt/mpich/lib/libmpi.so.0
#9 0x00007ffff5a062ca in ?? () from /opt/mpich/lib/libmpi.so.0
#10 0x00007ffff5980187 in ?? () from /opt/mpich/lib/libmpi.so.0
#11 0x00007ffff5981ca1 in ?? () from /opt/mpich/lib/libmpi.so.0
#12 0x00007ffff597eadd in ?? () from /opt/mpich/lib/libmpi.so.0
#13 0x00007ffff59a7a36 in ?? () from /opt/mpich/lib/libmpi.so.0
#14 0x00007ffff59a7f7b in ?? () from /opt/mpich/lib/libmpi.so.0
#15 0x00007ffff57f7015 in PMPI_Init () from /opt/mpich/lib/libmpi.so.0
#16 0x000055555555515c in main ()
Use --enable-g=dbg to add -g option to the compiler.
I'm not following why it still swallows. I added --enable-g=dbg and added -O -g to MPICHLIB_CFLAGS. (-O0 conflicts with fortify, and thus fails.)
(gdb) bt
#0 0x00007ffff5316653 in uct_base_iface_t_init () from /opt/mpich/lib/libuct.so.0
#1 0x00007ffff4f832c4 in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#2 0x00007ffff4f838f8 in ?? () from /opt/mpich/lib/ucx/libuct_rocm.so.0
#3 0x00007ffff5314661 in uct_iface_open () from /opt/mpich/lib/libuct.so.0
#4 0x00007ffff5383e6e in ucp_worker_iface_open () from /opt/mpich/lib/libucp.so.0
#5 0x00007ffff53844d1 in ?? () from /opt/mpich/lib/libucp.so.0
#6 0x00007ffff53868e9 in ucp_worker_create () from /opt/mpich/lib/libucp.so.0
#7 0x00007ffff5a2ed8e in ?? () from /opt/mpich/lib/libmpi.so.0
#8 0x00007ffff5a2f63f in ?? () from /opt/mpich/lib/libmpi.so.0
#9 0x00007ffff5a1265a in ?? () from /opt/mpich/lib/libmpi.so.0
#10 0x00007ffff5993e47 in ?? () from /opt/mpich/lib/libmpi.so.0
#11 0x00007ffff5995731 in ?? () from /opt/mpich/lib/libmpi.so.0
#12 0x00007ffff5992e1d in ?? () from /opt/mpich/lib/libmpi.so.0
#13 0x00007ffff59b9656 in ?? () from /opt/mpich/lib/libmpi.so.0
#14 0x00007ffff59b9c2b in ?? () from /opt/mpich/lib/libmpi.so.0
#15 0x00007ffff5829ef5 in PMPI_Init () from /opt/mpich/lib/libmpi.so.0
#16 0x000055555555515c in main () at initnull.c:5
I don't know why the extra -O2 gets slapped on at the end, but that wouldn't prevent basic debugging in the trace. Are my other configure flags somehow conflicting?
$ mpichversion
MPICH Version: 4.1a1
MPICH Release date: Fri May 6 15:51:18 CDT 2022
MPICH Device: ch4:ucx
MPICH configure: --prefix=/opt/mpich --with-device=ch4:ucx --with-hwloc --without-java --enable-error-checking=runtime --enable-error-messages=all --enable-g=dbg CC=gcc CXX=g++ FC=gfortran
MPICH CC: gcc -march=native -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -Wno-error=array-bounds -O -g -O2
MPICH CXX: g++ -march=native -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -O -g -O2
MPICH F77: gfortran -g -O2
MPICH FC: gfortran -g -O2
MPICH Custom Information:
Nevertheless, they consistently point to the ucx stack during Init. We may get better luck escalating to the UCX folks.
I'm kind of sad to see mpich-4.1 ship without a fixed UCX version, meaning that standard builds will fail anywhere with binutils-2.39.
Ahh, we forgot about this ticket. Thanks for the reminder. We'll make sure to have it fixed and backport it to 4.1.1. cc @raffenet
Thank you! If that'll come soon, we can just update PETSc to use it, otherwise we'll add the workaround.
@raffenet, I think this could be closed now?
Yes. Thanks for the reminder! Fix was included in main and 4.1.x branches. Released in 4.1.1.