invalid memory reference during builds for s390x
Expected behavior
All builds for s390x should be successful for openSUSE Tumbleweed.
Actual behavior
arpack-ng packages for mpi are failing because of a invalid memory reference.
Error messages
arpack-ng:openmpi1
[ 79s] 9/9 Test #9: issue46_tst ......................***Exception: SegFault 0.02 sec
[ 79s]
[ 79s] Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
[ 79s]
[ 79s] Backtrace for this error:
[ 79s] #0 0x3ff9cfa378b in ???
[ 79s] #1 0x3ff9cfa2719 in ???
[ 79s] #2 0x3ff9e0fe487 in ???
[ 79s] #3 0x3ff9c97fc08 in ???
[ 79s] #4 0x3ff9c987391 in ???
[ 79s] #5 0x3ff9c992ae5 in ???
[ 79s] #6 0x3ff9c972197 in ???
[ 79s] #7 0x3ff9ca4b41d in ???
[ 79s] #8 0x3ff9ca699c7 in ???
[ 79s] #9 0x3ff9cf5ac99 in ???
[ 79s] #10 0x2aa2b1027ed in issue46
[ 79s] at /home/abuild/rpmbuild/BUILD/arpack-ng-3.8.0/PARPACK/TESTS/MPI/issue46.f:15
[ 79s] #11 0x2aa2b1013bf in main
[ 79s] at /home/abuild/rpmbuild/BUILD/arpack-ng-3.8.0/PARPACK/TESTS/MPI/issue46.f:32
arpack-ng:openmpi2
[ 73s] 8/9 Test #9: issue46_tst ......................***Failed 0.01 sec
[ 73s] [s390zl28:02838] *** Process received signal ***
[ 73s] [s390zl28:02838] Signal: Segmentation fault (11)
[ 73s] [s390zl28:02838] Signal code: Address not mapped (1)
[ 73s] [s390zl28:02838] Failing at address: 0xfffffffffffff000
[ 73s] [s390zl28:02838] [ 0] linux-vdso64.so.1(__kernel_rt_sigreturn+0x0)[0x3ff9287e490]
[ 73s] [s390zl28:02838] [ 1] /usr/lib64/mpi/gcc/openmpi2/lib64/libopen-pal.so.20(+0x8d408)[0x3ff9240d408]
[ 73s] [s390zl28:02838] [ 2] /usr/lib64/mpi/gcc/openmpi2/lib64/libopen-pal.so.20(+0x94bd8)[0x3ff92414bd8]
[ 73s] [s390zl28:02838] [ 3] /usr/lib64/mpi/gcc/openmpi2/lib64/libopen-pal.so.20(opal_hwloc1112_hwloc_topology_load+0xd6)[0x3ff92423d36]
[ 73s] [s390zl28:02838] [ 4] /usr/lib64/mpi/gcc/openmpi2/lib64/libopen-pal.so.20(opal_hwloc_base_get_topology+0x78)[0x3ff923fe268]
[ 73s] [s390zl28:02838] [ 5] /usr/lib64/mpi/gcc/openmpi2/lib64/openmpi/mca_ess_hnp.so(+0x5380)[0x3ff92205380]
[ 73s] [s390zl28:02838] [ 6] /usr/lib64/mpi/gcc/openmpi2/lib64/libopen-rte.so.20(orte_init+0x25c)[0x3ff9271a29c]
[ 73s] [s390zl28:02838] [ 7] /usr/lib64/mpi/gcc/openmpi2/lib64/libopen-rte.so.20(orte_daemon+0x1ce)[0x3ff92739ba6]
[ 73s] [s390zl28:02838] [ 8] /lib64/libc.so.6(+0x33926)[0x3ff924b3926]
[ 73s] [s390zl28:02838] [ 9] /lib64/libc.so.6(__libc_start_main+0xa0)[0x3ff924b3a08]
[ 73s] [s390zl28:02838] [10] orted(+0x928)[0x2aa30b80928]
[ 73s] [s390zl28:02838] *** End of error message ***
[ 73s] [s390zl28:02836] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 582
[ 73s] [s390zl28:02836] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
arpack-ng:openmpi3
[ 118s] 9/9 Test #9: issue46_tst ......................***Failed 0.17 sec
[ 118s] [s390zp25:02863] *** Process received signal ***
[ 118s] [s390zp25:02863] Signal: Segmentation fault (11)
[ 118s] [s390zp25:02863] Signal code: Address not mapped (1)
[ 118s] [s390zp25:02863] Failing at address: 0xfffffffffffff000
[ 118s] [s390zp25:02863] [ 0] linux-vdso64.so.1(__kernel_rt_sigreturn+0x0)[0x3ffb38fe490]
[ 118s] [s390zp25:02863] [ 1] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(+0x98666)[0x3ffb3498666]
[ 118s] [s390zp25:02863] [ 2] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(+0xa051a)[0x3ffb34a051a]
[ 118s] [s390zp25:02863] [ 3] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(opal_hwloc1117_hwloc_topology_load+0xf8)[0x3ffb34b0640]
[ 118s] [s390zp25:02863] [ 4] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(opal_hwloc_base_get_topology+0x4c8)[0x3ffb348b438]
[ 118s] [s390zp25:02863] [ 5] /usr/lib64/mpi/gcc/openmpi3/lib64/openmpi/mca_ess_hnp.so(+0x5af4)[0x3ffb3105af4]
[ 118s] [s390zp25:02863] [ 6] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-rte.so.40(orte_init+0x2ce)[0x3ffb380f3ce]
[ 118s] [s390zp25:02863] [ 7] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-rte.so.40(orte_daemon+0x25e)[0x3ffb37be476]
[ 118s] [s390zp25:02863] [ 8] /lib64/libc.so.6(+0x33926)[0x3ffb3533926]
[ 118s] [s390zp25:02863] [ 9] /lib64/libc.so.6(__libc_start_main+0xa0)[0x3ffb3533a08]
[ 118s] [s390zp25:02863] [10] orted(+0x928)[0x2aa16900928]
[ 118s] [s390zp25:02863] *** End of error message ***
[ 118s] [s390zp25:02861] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 532
[ 118s] [s390zp25:02861] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
Where/how to reproduce the problem
-
arpack-ng: 3.8.0
-
OS: openSUSE Tumbleweed for the architecture s390x
-
compiler: gcc-c++-11-6.1 openmpi3-3.1.6-4.1 cmake-3.23.0-1.1 gcc-fortran-11-6.1
-
environment: export 'FFLAGS=-O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC' export 'FCFLAGS=-O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC' export 'CFLAGS=-O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC' export 'CXXFLAGS=-O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC' export LD_LIBRARY_PATH=/usr/lib64/mpi/gcc/openmpi3/lib64 export CC=/usr/lib64/mpi/gcc/openmpi3/bin/mpicc export CXX=/usr/lib64/mpi/gcc/openmpi3/bin/mpic++ export F77=/usr/lib64/mpi/gcc/openmpi3/bin/mpif77 export MPIF77=/usr/lib64/mpi/gcc/openmpi3/bin/mpif77
-
configure: /usr/bin/cmake /home/abuild/rpmbuild/BUILD/arpack-ng-3.8.0/. '-GUnix Makefiles' -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DCMAKE_INSTALL_LIBDIR:PATH=lib64 -DCMAKE_INSTALL_LIBEXECDIR=/usr/libexec -DCMAKE_BUILD_TYPE=RelWithDebInfo '-DCMAKE_C_FLAGS=-O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC -DNDEBUG' '-DCMAKE_CXX_FLAGS=-O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC -DNDEBUG' '-DCMAKE_Fortran_FLAGS=-O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC -DNDEBUG' '-DCMAKE_EXE_LINKER_FLAGS=-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' '-DCMAKE_MODULE_LINKER_FLAGS=-flto=auto -Wl,--as-needed' '-DCMAKE_SHARED_LINKER_FLAGS=-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' -DLIB_SUFFIX=64 -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=ON -DBUILD_STATIC_LIBS:BOOL=OFF -DCMAKE_COLOR_MAKEFILE:BOOL=OFF -DCMAKE_INSTALL_DO_STRIP:BOOL=OFF -DCMAKE_MODULES_INSTALL_DIR=/usr/lib64/cmake/parpack-openmpi3 -DCMAKE_INSTALL_PREFIX:PATH=/usr/lib64/mpi/gcc/openmpi3 -DCMAKE_INSTALL_LIBDIR:PATH=/usr/lib64/mpi/gcc/openmpi3/lib64 -DCMAKE_SKIP_RPATH:BOOL=OFF -DCMAKE_SKIP_INSTALL_RPATH:BOOL=ON -DCMAKE_CXX_COMPILER_VERSION=11.2.1 -DMPI:BOOL=ON -DPYTHON3:BOOL=OFF
- [ 66s] -- Configuration summary for arpack-ng-3.8.0:
[ 66s] -- prefix: /usr/lib64/mpi/gcc/openmpi3
[ 66s] -- MPI: ON
[ 66s] -- ICB: OFF
[ 66s] -- INTERFACE64: 0
[ 66s] -- FC: /usr/bin/gfortran
[ 66s] -- FCFLAGS: -O2 -g -DNDEBUG -O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -fPIC -DNDEBUG -cpp -ffixed-line-length-none
[ 66s] -- MPIFC:
[ 66s] -- compile: /usr/lib64/mpi/gcc/openmpi3/include
[ 66s] -- compile: /usr/lib64/mpi/gcc/openmpi3/lib64
[ 66s] -- link: /usr/lib64/mpi/gcc/openmpi3/lib64/libmpi_usempif08.so
[ 66s] -- link: /usr/lib64/mpi/gcc/openmpi3/lib64/libmpi_usempi_ignore_tkr.so
[ 66s] -- link: /usr/lib64/mpi/gcc/openmpi3/lib64/libmpi_mpifh.so
[ 66s] -- link: /usr/lib64/mpi/gcc/openmpi3/lib64/libmpi.so
[ 66s] -- BLAS:
[ 66s] -- link: /usr/lib64/libopenblas.so
[ 66s] -- LAPACK:
[ 66s] -- link: -lm
[ 66s] -- link: -ldl
[ 66s] -- link: BLAS::BLAS
[ 66s] -- Configuring done
Steps to reproduce the problem
- Build arpack-ng openmpi modules for s390x on openSUSE Tumbleweed
- arpack-ng:opnmpi1 until arpack-ng:openmpi4 are failing.
- The reason is a Segmentation fault because of a invalid memory reference.
Error message
[ 118s] 9/9 Test #9: issue46_tst ......................***Failed 0.17 sec
[ 118s] [s390zp25:02863] *** Process received signal ***
[ 118s] [s390zp25:02863] Signal: Segmentation fault (11)
[ 118s] [s390zp25:02863] Signal code: Address not mapped (1)
[ 118s] [s390zp25:02863] Failing at address: 0xfffffffffffff000
[ 118s] [s390zp25:02863] [ 0] linux-vdso64.so.1(__kernel_rt_sigreturn+0x0)[0x3ffb38fe490]
[ 118s] [s390zp25:02863] [ 1] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(+0x98666)[0x3ffb3498666]
[ 118s] [s390zp25:02863] [ 2] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(+0xa051a)[0x3ffb34a051a]
[ 118s] [s390zp25:02863] [ 3] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(opal_hwloc1117_hwloc_topology_load+0xf8)[0x3ffb34b0640]
[ 118s] [s390zp25:02863] [ 4] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-pal.so.40(opal_hwloc_base_get_topology+0x4c8)[0x3ffb348b438]
[ 118s] [s390zp25:02863] [ 5] /usr/lib64/mpi/gcc/openmpi3/lib64/openmpi/mca_ess_hnp.so(+0x5af4)[0x3ffb3105af4]
[ 118s] [s390zp25:02863] [ 6] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-rte.so.40(orte_init+0x2ce)[0x3ffb380f3ce]
[ 118s] [s390zp25:02863] [ 7] /usr/lib64/mpi/gcc/openmpi3/lib64/libopen-rte.so.40(orte_daemon+0x25e)[0x3ffb37be476]
[ 118s] [s390zp25:02863] [ 8] /lib64/libc.so.6(+0x33926)[0x3ffb3533926]
[ 118s] [s390zp25:02863] [ 9] /lib64/libc.so.6(__libc_start_main+0xa0)[0x3ffb3533a08]
[ 118s] [s390zp25:02863] [10] orted(+0x928)[0x2aa16900928]
[ 118s] [s390zp25:02863] *** End of error message ***
[ 118s] [s390zp25:02861] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 532
[ 118s] [s390zp25:02861] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
Notes, remarks
Feel free to propose a PR. We can merge it when CI is back (CI down for now).
Back at time when CI was handled with Travis 51b299cba0cde3c27817309eb48a3d7f890967e9 : opensuse build was OK.
Is s390x this arch: https://en.wikipedia.org/wiki/IBM_System/390? If so, seems it's 32-bit arch but you compile with -m64.
IBM Systems is providing s390 and s390x. s390 is 32 bit. s390x is the chosen architecture and 64 bit. We build only for the 64 bit mainframe architecture. That is a good explanation about that: https://www.ibm.com/docs/en/cics-ts/5.6?topic=basics-24-bit-31-bit-64-bit-addressing
"64-bit architecture, uses 64-bit storage addresses and 64-bit integer arithmetic and logical instructions" is following from these slides: KIT Z Architecture lecture
That is a nice extension with a deeper insight into the memory management. I look, that I can find the issue this week based on these 2 tutorials/lecture slides.
The linux-vdso64.so is applied. The manpage is saying that about the s390x architecture: The table below lists the symbols exported by the vDSO.
symbol version
──────────────────────────────────────
__kernel_clock_getres LINUX_2.6.29
__kernel_clock_gettime LINUX_2.6.29
__kernel_gettimeofday LINUX_2.6.29
"64-bit architecture, uses 64-bit storage addresses and 64-bit integer arithmetic and logical instructions"
If s390x uses 64-bit integers, you may need to set INTERFACE64=1
I set the INTERFACE64=1 in our spec file.
I have got a new error message now:
[ 21s] /home/abuild/rpmbuild/BUILD/arpack-ng-3.8.0/PARPACK/TESTS/MPI/issue46.f:16:26:
[ 21s]
[ 21s] 16 | call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
[ 21s] | 1
[ 21s] ......
[ 21s] 113 | call MPI_COMM_RANK( comm, myid, ierr )
[ 21s] | 2
[ 21s] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(8)/INTEGER(4)).
[ 21s] /home/abuild/rpmbuild/BUILD/arpack-ng-3.8.0/PARPACK/TESTS/MPI/issue46.f:17:26:
[ 21s]
[ 21s] 17 | call MPI_COMM_SIZE( MPI_COMM_WORLD, nprocs, ierr )
[ 21s] | 1
[ 21s] ......
[ 21s] 114 | call MPI_COMM_SIZE( comm, nprocs, ierr )
[ 21s] | 2
Not even surprised! :D Years ago, I tried to PR a patch for this problem, but the CI was failing for reasons I never understood?!... I guess at the time, CI boxes where old and were shipped with openmpi version (module mpi_f08) that didn't fully implement the 2008 Fortran standard...
I should retrieve the commit and PR it soon: hope CI won't break this time and may fix this problem too...
@skriesch: if #368 does not break the CI, try to checkout the branch PR and test if this fix your issue
The patch was not compatible with the version 3.8.0. Therefore, I have tested it based on the master branch. We have got new error messages (and more).