dbcsr
dbcsr copied to clipboard
mpich test failure on s390x
Describe the bug I'm working on a Fedora package for dbcsr. I'm getting test failures with mpich on s390x.
To Reproduce
/usr/bin/ctest --test-dir redhat-linux-build-mpich --output-on-failure --force-new-ctest-process -j3
Internal ctest changing into directory: /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
Test project /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
Start 1: dbcsr_perf:inputs/test_H2O.perf
Start 2: dbcsr_perf:inputs/test_rect1_dense.perf
Start 3: dbcsr_perf:inputs/test_rect1_sparse.perf
1/19 Test #3: dbcsr_perf:inputs/test_rect1_sparse.perf ..............***Failed 2.10 sec
DBCSR| CPU Multiplication driver BLAS (D)
DBCSR| Multrec recursion limit 512 (D)
DBCSR| Multiplication stack size 1000 (D)
DBCSR| Maximum elements for images UNLIMITED (D)
DBCSR| Multiplicative factor virtual images 1 (D)
DBCSR| Use multiplication densification T (D)
DBCSR| Multiplication size stacks 3 (D)
DBCSR| Use memory pool for CPU allocation F (D)
DBCSR| Number of 3D layers SINGLE (D)
DBCSR| Use MPI memory allocation F (D)
DBCSR| Use RMA algorithm F (U)
DBCSR| Use Communication thread T (D)
DBCSR| Communication thread load 100 (D)
DBCSR| MPI: My process id 0
DBCSR| MPI: Number of processes 2
DBCSR| OMP: Current number of threads 2
DBCSR| OMP: Max number of threads 2
DBCSR| Split modifier for TAS multiplication algorithm 1.0E+00 (D)
numthreads 2
numnodes 2
matrix_sizes 5000 1000 1000
sparsities 0.90000000000000002 0.90000000000000002 0.90000000000000002
trans NN
symmetries NNN
type 3
alpha_in 1.0000000000000000 0.0000000000000000
beta_in 1.0000000000000000 0.0000000000000000
limits 1 5000 1 1000 1 1000
retain_sparsity F
nrep 10
bs_m 1 5
bs_n 1 5
bs_k 1 5
*******************************************************************************
* MPI error 5843983 in mpi_barrier @ mp_sync : Other MPI error, *
* error stack:
internal_Barrier(84).......................: *
* MPI_Barrier(comm=0x84000001) *
* ___ failed
MPID_Barrier(167)..........................: *
* / \
MPIDI_Barrier_allcomm_composition_json(132): *
* [ABORT]
MPIDI_POSIX_mpi_bcast(219).................: *
* \___/
MPIDI_POSIX_mpi_bcast_release_gather(132)..: *
* |
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not *
* O/| match across processes in the collective routine: Received 0 but *
* /| | expected 1 *
* / \ dbcsr_mpiwrap.F:1186 *
*******************************************************************************
===== Routine Calling Stack =====
4 mp_sync
3 perf_multiply
2 dbcsr_perf_multiply_low
1 dbcsr_performance_driver
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
STOP 1
I don't see test failures with openmpi. One difference is that mpich is being built with -DUSE_MPI_F08=ON
.
Environment:
- Operating system & version Fedora Rawhide
- Compiler vendor & version gcc 13.2.1
- Build environment (make or cmake) cmake
- Configuration of DBCSR (either the cmake flags or the
Makefile.inc
)/usr/bin/cmake -S . -B redhat-linux-build-mpich -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_DO_STRIP:BOOL=OFF -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_INSTALL_Fortran_MODULES=/usr/lib64/gfortran/modules/mpich -DUSE_MPI_F08=ON -DCMAKE_PREFIX_PATH:PATH=/usr/lib64/mpich -DCMAKE_INSTALL_PREFIX:PATH=/usr/lib64/mpich -DCMAKE_INSTALL_LIBDIR:PATH=lib -- The C compiler identification is GNU 13.2.1
- MPI implementation and version mpich 4.1.2
- If CUDA is being used: CUDA version and GPU architecture No CUDA
- BLAS/LAPACK implementation and version flexiblas 3.3.1 -> openblas 0.3.21
I've realized that we are not testing with MPI_F08 in our CI, however we did a test here https://github.com/cp2k/dbcsr/issues/661#issuecomment-1621787249 and it worked. the only difference was GCC 13.1. I will add the test to the CI. In the meantime, I see some actions here:
- could you build with F08 and OpenMPI?
- any chance you can use GCC 13.1 and mpich with F08 in DBCSR?
I've enabled -DUSE_MPI_F08=ON for the openmpi builds as well. Scratch builds are here (for a week or two)
F40 - gcc 13.2.1 mpich 4.1.2 - https://koji.fedoraproject.org/koji/taskinfo?taskID=110306721
Tests are still failing.
We are stuck with the version of the compiler in the distribution which is at 13.2.1 in all current Fedora releases.
Interestingly though, the tests are succeeding in F38:
https://koji.fedoraproject.org/koji/taskinfo?taskID=110306885
which is with mpich 4.0.3. So maybe it's more of an mpich issue than DBCSR. Though mpich's own basic test suite is passing.
Also different: openblas 0.3.21 -> 0.3.25