ompi
ompi copied to clipboard
Crash on MPI_Init with Open MPI 5.0.0, Intel Fortran, and -init=snan
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Open MPI 5.0.0 and 4.1.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Both were installed from tarball. 4.1.5 was installed via:
-- mkdir build-intel-2021.6.0-SLES15 && cd build-intel-2021.6.0-SLES15
-- ../configure --disable-wrapper-rpath --disable-wrapper-runpath --with-pmi --with-slurm \
-- --enable-mpi1-compatibility --with-pmix --without-verbs \
-- CC=icc CXX=icpc FC=ifort \
-- --prefix=/discover/swdev/gmao_SIteam/MPI/openmpi/4.1.5-SLES15/intel-2021.6.0 |& tee configure.intel-2021.6.0.log
and 5.0.0 was installed with:
-- mkdir build-intel-2021.6.0-SLES15 && cd build-intel-2021.6.0-SLES15
-- ../configure --disable-wrapper-rpath --disable-wrapper-runpath --with-slurm \
-- --enable-mpi1-compatibility --with-pmix \
-- CC=icc CXX=icpc FC=ifort \
-- --prefix=/discover/swdev/gmao_SIteam/MPI/openmpi/5.0.0-SLES15/intel-2021.6.0 |& tee configure.intel-2021.6.0.log
Please describe the system on which you are running
- Operating system/version: SLES 15 SP4
- Computer hardware: AMD Milan cluster
- Network type: Infiniband
Details of the problem
Given this program:
program a
use mpi
implicit none
integer :: ierror
call mpi_init(ierror)
call MPI_Finalize(ierror)
end program
we seem to be able to trigger a crash with Open MPI 5.0.0 when using -init=snan
. For example:
$ mpifort -V && mpirun -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
mpirun (Open MPI) 5.0.0
Report bugs to https://www.open-mpi.org/community/help/
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
forrtl: error (65): floating invalid
Image PC Routine Line Source
just_init.exe 000000000040A40B Unknown Unknown Unknown
libpthread-2.31.s 000014ACCCEA5910 Unknown Unknown Unknown
libxml2.so.2.9.14 000014ACC70DE92F xmlXPathInit Unknown Unknown
libxml2.so.2.9.14 000014ACC7093C03 Unknown Unknown Unknown
libxml2.so.2.9.14 000014ACC708EE3B xmlCheckVersion Unknown Unknown
libhwloc.so.15.6. 000014ACCAB49862 Unknown Unknown Unknown
libhwloc.so.15.6. 000014ACCAB3E32B Unknown Unknown Unknown
libhwloc.so.15.6. 000014ACCAB31327 Unknown Unknown Unknown
libopen-pal.so.80 000014ACCC03E3B9 opal_hwloc_base_g Unknown Unknown
libopen-pal.so.80 000014ACCC0222D0 Unknown Unknown Unknown
libopen-pal.so.80 000014ACCBFAD447 mca_base_framewor Unknown Unknown
libopen-pal.so.80 000014ACCC01BF04 Unknown Unknown Unknown
libopen-pal.so.80 000014ACCBFAE5FB mca_base_framewor Unknown Unknown
libopen-pal.so.80 000014ACCBFAE5FB mca_base_framewor Unknown Unknown
libmpi.so.40.40.0 000014ACCD0C45D9 Unknown Unknown Unknown
libmpi.so.40.40.0 000014ACCD0C4476 ompi_mpi_instance Unknown Unknown
libmpi.so.40.40.0 000014ACCD0B61C0 ompi_mpi_init Unknown Unknown
libmpi.so.40.40.0 000014ACCD0EF62D MPI_Init Unknown Unknown
libmpi_mpifh.so.4 000014ACCD49F9B7 PMPI_Init_f08 Unknown Unknown
just_init.exe 000000000040952E MAIN__ 5 just_init.F90
just_init.exe 00000000004094E2 Unknown Unknown Unknown
libc-2.31.so 000014ACCCCC824D __libc_start_main Unknown Unknown
just_init.exe 00000000004093FA Unknown Unknown Unknown
However, the same thing with 4.1.5 works:
$ mpifort -V && mpirun -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
mpirun (Open MPI) 4.1.5
Report bugs to http://www.open-mpi.org/community/help/
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
$
If I don't use the -init=snan
, all is well with Open MPI 5.0.0:
$ mpirun -V
mpirun (Open MPI) 5.0.0
Report bugs to https://www.open-mpi.org/community/help/
$ mpifort -g -O0 -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
$
I also tried Intel MPI 2021.10.0 and that works:
$ mpiifort -V && mpirun -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
Intel(R) MPI Library for Linux* OS, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
Copyright 2003-2023, Intel Corporation.
$ mpiifort -g -O0 -init=snan -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
$
Just for completeness I built Open MPI 5.0.0 using ifx
instead of ifort
(and icx
and icpx
) and if I use ifx
instead of ifort
with Open MPI 5.0.0 I get a crash:
$ mpifort -V && mpirun -V
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230721
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
mpirun (Open MPI) 5.0.0
Report bugs to https://www.open-mpi.org/community/help/
mathomp4@borgl161 ~/MPITests main ?3
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
forrtl: error (65): floating invalid
Image PC Routine Line Source
libpthread-2.31.s 0000150210F21910 Unknown Unknown Unknown
libxml2.so.2.9.14 000015020C67392F xmlXPathInit Unknown Unknown
libxml2.so.2.9.14 000015020C628C03 Unknown Unknown Unknown
libxml2.so.2.9.14 000015020C623E3B xmlCheckVersion Unknown Unknown
libhwloc.so.15.6. 000015020EBA4862 Unknown Unknown Unknown
libhwloc.so.15.6. 000015020EB9932B Unknown Unknown Unknown
libhwloc.so.15.6. 000015020EB8C327 Unknown Unknown Unknown
libopen-pal.so.80 00001502100C2E72 opal_hwloc_base_g Unknown Unknown
libopen-pal.so.80 00001502100AAF49 Unknown Unknown Unknown
libopen-pal.so.80 000015021003804A mca_base_framewor Unknown Unknown
libopen-pal.so.80 00001502100A5ED6 Unknown Unknown Unknown
libopen-pal.so.80 0000150210038C61 mca_base_framewor Unknown Unknown
libopen-pal.so.80 0000150210038C61 mca_base_framewor Unknown Unknown
libmpi.so.40.40.0 0000150211536B12 ompi_mpi_instance Unknown Unknown
libmpi.so.40.40.0 000015021152A8EA ompi_mpi_init Unknown Unknown
libmpi.so.40.40.0 000015021156A070 MPI_Init Unknown Unknown
libmpi_mpifh.so.4 00001502118CF3F8 PMPI_Init_f08 Unknown Unknown
just_init.exe 0000000000409ACF a 5 just_init.F90
just_init.exe 0000000000409A8D Unknown Unknown Unknown
libc-2.31.so 0000150210D4224D __libc_start_main Unknown Unknown
just_init.exe 00000000004099BA Unknown Unknown Unknown
Indeed, this uses ifx 2023.2.0 (instead of ifort 2021.6.0) so it's the latest Intel compiler that I have access to.
The thing is, as far as I know, -init=snan
only initializes real and complex:
[no]snan Determines whether the compiler initializes to signaling NaN all uninitialized variables of intrinsic type REAL or COMPLEX that are saved, local,
automatic, or allocated variables.
and my code has a single integer
and no reals! It's like somehow I'm...corrupting the code with that flag? 🤷🏼
interesting. I couldn't get past the configure stage if I set FCFLAGS="-init=snan"
interesting. I couldn't get past the configure stage if I set
FCFLAGS="-init=snan"
Oh. I didn't configure Open MPI with that flag (though...it should work?), I built it "normally", and just the example program showed the issue.
I hit another problem. If one is trying to build a clone of ompi the variant of openpmix in the main branch doesn't compile with Intel classic compiler owing to picky compiler option args by default. One doesn't see this if building from a release tarball.
Pass along the warnings and we can address them. You can disable the picky option on the configure line, if you like.
Now I have something built with -init=snan and I can't reproduce the issue. Curiously my executables have no dependency on libxml2.so. Can you provide the output of ldd just_init.exe
to see which libhwloc you are using? Same output from the mpich compiled variant would also be useful.
Note this exception is occurring in a call within libxml2.so to a function in the pthread library. So I'm not sure that this is something we can fix in Open MPI. You may want to try configuring Open MPI with --with-hwloc=internal
. The internal hwloc package is built without --disable-plugin-dlopen
set so the xml2 dependency in that case is in a plugin which doesn't seem to get loaded in my environment.
@rhc54 don't worry about PMIx and intel classic compilers. Users aren't supposed to be using this compiler anyway as they (esp. icc/icpc) are past end of life. Indeed as I compiled OMPI for this issue I got a slew of warning messages about using an end of life compiler. I'll check with oneapi compilers and if they cause a problem i'll open an issue on openpmix.
Have you had a chance to try the --with-hwloc-internal
configure option to see if that is a workaround for this issue?
Have you had a chance to try the
--with-hwloc-internal
configure option to see if that is a workaround for this issue?
Yes. I tried with --with-hwloc=internal --with-pmix=internal --with-libevent=internal
and it still failed.
FWIW I am unable to reproduce the issue on a RHEL8 system with
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230622
$ ../../src/ompi-v5.0.x/configure --prefix=$HOME/local/ompi-v5.0.x-isan --with-hwloc=internal --with-pmix=internal --with-libevent=internal --disable-wrapper-rpath --disable-wrapper-runpath CC=icx CXX=icpx FC=ifx && make -j 24 install
$ ~/local/ompi-v5.0.x-isan/bin/mpifort -g -O0 -init=snan -traceback -o inifini inifini.F90
~/local/ompi-v5.0.x-isan/bin/mpirun -np 1 --mca btl_tcp_if_include ib0 env LD_LIBRARY_PATH=$HOME/local/ompi-v5.0.x-isan/lib:$LD_LIBRARY_PATH ./inifini
the internal hwloc
does not depend on libxml2
, so even if it fails in your environment, the stack trace should be different. Also, in my environment, I am unable to get it work unless I manually set LD_LIBRARY_PATH
manually pointing to the Open MPI libraries.
Out of curiosity, are you able to reproduce the issue if you submit in singleton mode (e.g. ./just_init.exe
)?
Can you please post the stack trace with Open MPI built with --with-hwloc=internal
)?
@ggouaillardet I grabbed 5.0.1 and did:
ml comp/gcc/12.3.0 comp/intel/2023.2.1-ifx
mkdir build-intel-2023.2.1-ifx-SLES15 && cd build-intel-2023.2.1-ifx-SLES15
../configure --disable-wrapper-rpath --disable-wrapper-runpath --with-slurm \
--with-pmix=internal --with-hwloc=internal --with-libevent=internal \
CC=icx CXX=icpx FC=ifx \
--prefix=/discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx |& tee configure.intel-2023.2.1-ifx.log
mv config.log config.intel-2023.2.1-ifx.log
make -j6 |& tee make.intel-2023.2.1-ifx.log
make install |& tee makeinstall.intel-2023.2.1-ifx.log
make check |& tee makecheck.intel-2023.2.1-ifx.log
And:
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && mpirun -np 1 ./just_init.exe
forrtl: error (65): floating invalid
Image PC Routine Line Source
libpthread-2.31.s 00001456760D9910 Unknown Unknown Unknown
libxml2.so.2.9.14 00001456711E392F xmlXPathInit Unknown Unknown
libxml2.so.2.9.14 0000145671198C03 Unknown Unknown Unknown
libxml2.so.2.9.14 0000145671193E3B xmlCheckVersion Unknown Unknown
hwloc_xml_libxml. 000014567624141D Unknown Unknown Unknown
libhwloc.so.15.5. 000014567415A687 Unknown Unknown Unknown
libhwloc.so.15.5. 0000145674147EF3 Unknown Unknown Unknown
libopen-pal.so.80 000014567527AF32 opal_hwloc_base_g Unknown Unknown
libopen-pal.so.80 0000145675262EB9 Unknown Unknown Unknown
libopen-pal.so.80 00001456751EFFCA mca_base_framewor Unknown Unknown
libopen-pal.so.80 000014567525DE46 Unknown Unknown Unknown
libopen-pal.so.80 00001456751F0BE1 mca_base_framewor Unknown Unknown
libopen-pal.so.80 00001456751F0BE1 mca_base_framewor Unknown Unknown
libmpi.so.40.40.1 00001456766EE542 ompi_mpi_instance Unknown Unknown
libmpi.so.40.40.1 00001456766E231A ompi_mpi_init Unknown Unknown
libmpi.so.40.40.1 0000145676721AA0 MPI_Init Unknown Unknown
libmpi_mpifh.so.4 0000145676A873F8 PMPI_Init_f08 Unknown Unknown
just_init.exe 0000000000409ACF a 5 just_init.F90
just_init.exe 0000000000409A8D Unknown Unknown Unknown
libc-2.31.so 0000145675EFA24D __libc_start_main Unknown Unknown
just_init.exe 00000000004099BA Unknown Unknown Unknown
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 44562 on node borgl113 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------
Now, if I don't use mpirun
:
$ mpifort -g -O0 -init=snan -traceback -o just_init.exe just_init.F90 && ./just_init.exe
$ echo $?
0
If you tell me the right options, I can build Open MPI with debugging symbols, etc.
I still do not get why libxml2.so
gets pulled.
Can you please run
mpirun -np 1 ldd ./just_init.exe
in order to make sure the right Open MPI library gets pulled?
In order to build Open MPI with debug, you can simply configure
Open MPI with --enable-debug
$ mpirun -np 1 ldd ./just_init.exe
linux-vdso.so.1 (0x00007ffe50591000)
libmpi_usempif08.so.40 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libmpi_usempif08.so.40 (0x000014c7e4850000)
libmpi_usempi_ignore_tkr.so.40 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libmpi_usempi_ignore_tkr.so.40 (0x000014c7e4843000)
libmpi_mpifh.so.40 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libmpi_mpifh.so.40 (0x000014c7e47ce000)
libmpi.so.40 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libmpi.so.40 (0x000014c7e43e0000)
libimf.so => /usr/local/intel/oneapi/2021/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libimf.so (0x000014c7e3ff6000)
libm.so.6 => /lib64/libm.so.6 (0x000014c7e3e8b000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000014c7e3e67000)
libdl.so.2 => /lib64/libdl.so.2 (0x000014c7e3e62000)
libc.so.6 => /lib64/libc.so.6 (0x000014c7e3c69000)
libgcc_s.so.1 => /usr/local/other/gcc/12.3.0/lib64/libgcc_s.so.1 (0x000014c7e3c4a000)
libhcoll.so.1 => /opt/mellanox/hcoll/lib/libhcoll.so.1 (0x000014c7e3910000)
libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0 (0x000014c7e36ba000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000014c7e349a000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000014c7e327a000)
libgpfs.so => /usr/lib64/libgpfs.so (0x000014c7e3062000)
libopen-pal.so.80 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libopen-pal.so.80 (0x000014c7e2f57000)
libucp.so.0 => /usr/lib64/libucp.so.0 (0x000014c7e2c8e000)
libucs.so.0 => /usr/lib64/libucs.so.0 (0x000014c7e2624000)
libucm.so.0 => /usr/lib64/libucm.so.0 (0x000014c7e240b000)
libuct.so.0 => /usr/lib64/libuct.so.0 (0x000014c7e21d1000)
librt.so.1 => /lib64/librt.so.1 (0x000014c7e21c5000)
libpmix.so.2 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libpmix.so.2 (0x000014c7e1f77000)
libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x000014c7e1f6c000)
libutil.so.1 => /lib64/libutil.so.1 (0x000014c7e1f68000)
libevent_core-2.1.so.7 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libevent_core-2.1.so.7 (0x000014c7e1f34000)
libevent_pthreads-2.1.so.7 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libevent_pthreads-2.1.so.7 (0x000014c7e1f30000)
libhwloc.so.15 => /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.1-SLES15/intel-2023.2.1-ifx/lib/libhwloc.so.15 (0x000014c7e1ecd000)
libifport.so.5 => /usr/local/intel/oneapi/2021/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libifport.so.5 (0x000014c7e1ea3000)
libifcoremt.so.5 => /usr/local/intel/oneapi/2021/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libifcoremt.so.5 (0x000014c7e1d29000)
libintlc.so.5 => /usr/local/intel/oneapi/2021/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x000014c7e1cb1000)
libsvml.so => /usr/local/intel/oneapi/2021/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libsvml.so (0x000014c7e0682000)
libirng.so => /usr/local/intel/oneapi/2021/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libirng.so (0x000014c7e0369000)
/lib64/ld-linux-x86-64.so.2 (0x000014c7e488b000)
libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x000014c7e0147000)
libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x000014c7dfed1000)
libz.so.1 => /lib64/libz.so.1 (0x000014c7dfcba000)
I'll do a debug build now...
I do see in the configure output:
**** libxml2 configuration
checking for LIBXML2... yes
checking for libxml/parser.h... yes
checking for xmlNewDoc... yes
checking for final LIBXML2 support... yes
**** end of libxml2 configuration
Also, the debug traceback looks identical to the non-debug one. No additional line or source output. 🤷🏼
Thanks, yes libxml2.so
is detected at configure
time, but as you can see in the ldd
output, libhwloc.so
does not depend on it. I guess it gets pulled indirectly by a hwloc
plugin.
I will try again to reproduce the issue.
BTW, are you running from a SLURM job/allocation? Is there a free SuSE distro that is kind of similar to SLES15?
Yes, I am running from a SLURM allocation. On our cluster at least, MPI jobs aren't allowed on head nodes (or shouldn't be allowed if somehow you manage to do so).
And I believe OpenSUSE Leap 15 is the consumer equivalent of SLES 15, so OpenSUSE Leap 15.4 would be the closest match.
I am getting the same error as @mathomp4 .
(Please note that I work on the same HPC system as Matt.)
My program is not compiled with -init=snan
, but it is compiled with -fpe0
. According to the ifort
man page:
Setting the option [Q]init snan implicitly sets the option fpe 0.
I compiled Matt's test program replacing -init=snan
with -fpe0
and got a similar looking traceback.
I removed the -fpe0
from my program, and it is now running.
I hope this helps.
related to #12400 . see suggested workaround posted in that issue for a possible resolution to this one.
@hppritcha Indeed, I'm going to try this with Intel tomorrow (this afternoon at work got a bit...explody). My hope is that this is a fix for our system! :)
Huzzah! Note to @hppritcha and @jvgeiger if I add --disable-libxml2
things work for Intel 2021.6 as well (at least my small reproducer). I'm doing full builds of my stack with GCC and Intel to make sure things work in full.
Thanks!
Is there anything for Open MPI to fix here? Or is --disable-libxml2
good enough / perfect for fixing this integration issue?
I don't think so. The --disable-libxml2
get's passed to the hwloc configure which disables compilation of xml2 dependent parts of hwloc. it was libxml2 where the fpe was getting thrown, hence my suggestion to try this approach.
I think we are good. Both GCC and Intel seem to be good now with --disable-libxml2
. I'll close this and #12400