dbcsr icon indicating copy to clipboard operation
dbcsr copied to clipboard

mpich test failure on s390x

Open opoplawski opened this issue 1 year ago • 2 comments

Describe the bug I'm working on a Fedora package for dbcsr. I'm getting test failures with mpich on s390x.

To Reproduce

/usr/bin/ctest --test-dir redhat-linux-build-mpich --output-on-failure --force-new-ctest-process -j3
Internal ctest changing into directory: /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
Test project /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
      Start  1: dbcsr_perf:inputs/test_H2O.perf
      Start  2: dbcsr_perf:inputs/test_rect1_dense.perf
      Start  3: dbcsr_perf:inputs/test_rect1_sparse.perf
 1/19 Test  #3: dbcsr_perf:inputs/test_rect1_sparse.perf ..............***Failed    2.10 sec
 DBCSR| CPU Multiplication driver                                           BLAS (D)
 DBCSR| Multrec recursion limit                                              512 (D)
 DBCSR| Multiplication stack size                                           1000 (D)
 DBCSR| Maximum elements for images                                    UNLIMITED (D)
 DBCSR| Multiplicative factor virtual images                                   1 (D)
 DBCSR| Use multiplication densification                                       T (D)
 DBCSR| Multiplication size stacks                                             3 (D)
 DBCSR| Use memory pool for CPU allocation                                     F (D)
 DBCSR| Number of 3D layers                                               SINGLE (D)
 DBCSR| Use MPI memory allocation                                              F (D)
 DBCSR| Use RMA algorithm                                                      F (U)
 DBCSR| Use Communication thread                                               T (D)
 DBCSR| Communication thread load                                            100 (D)
 DBCSR| MPI: My process id                                                     0
 DBCSR| MPI: Number of processes                                               2
 DBCSR| OMP: Current number of threads                                         2
 DBCSR| OMP: Max number of threads                                             2
 DBCSR| Split modifier for TAS multiplication algorithm                  1.0E+00 (D)
 numthreads           2
 numnodes           2
 matrix_sizes        5000        1000        1000
 sparsities  0.90000000000000002       0.90000000000000002       0.90000000000000002     
 trans NN
 symmetries NNN
 type            3
 alpha_in   1.0000000000000000        0.0000000000000000     
 beta_in   1.0000000000000000        0.0000000000000000     
 limits           1        5000           1        1000           1        1000
 retain_sparsity F
 nrep          10
 bs_m           1           5
 bs_n           1           5
 bs_k           1           5
 *******************************************************************************
 *             MPI error 5843983 in mpi_barrier @ mp_sync : Other MPI error,   *
 *               error stack:
internal_Barrier(84).......................:     *
 *                             MPI_Barrier(comm=0x84000001)                    *
 *   ___            failed
MPID_Barrier(167)..........................:        *
 *  /   \              
MPIDI_Barrier_allcomm_composition_json(132):           *
 * [ABORT]             
MPIDI_POSIX_mpi_bcast(219).................:           *
 *  \___/              
MPIDI_POSIX_mpi_bcast_release_gather(132)..:           *
 *    |     
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not *
 *  O/|      match across processes in the collective routine: Received 0 but  *
 * /| |                                 expected 1                             *
 * / \                                                    dbcsr_mpiwrap.F:1186 *
 *******************************************************************************
 ===== Routine Calling Stack ===== 
            4 mp_sync
            3 perf_multiply
            2 dbcsr_perf_multiply_low
            1 dbcsr_performance_driver
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
STOP 1

I don't see test failures with openmpi. One difference is that mpich is being built with -DUSE_MPI_F08=ON.

Environment:

  • Operating system & version Fedora Rawhide
  • Compiler vendor & version gcc 13.2.1
  • Build environment (make or cmake) cmake
  • Configuration of DBCSR (either the cmake flags or the Makefile.inc) /usr/bin/cmake -S . -B redhat-linux-build-mpich -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_DO_STRIP:BOOL=OFF -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_INSTALL_Fortran_MODULES=/usr/lib64/gfortran/modules/mpich -DUSE_MPI_F08=ON -DCMAKE_PREFIX_PATH:PATH=/usr/lib64/mpich -DCMAKE_INSTALL_PREFIX:PATH=/usr/lib64/mpich -DCMAKE_INSTALL_LIBDIR:PATH=lib -- The C compiler identification is GNU 13.2.1
  • MPI implementation and version mpich 4.1.2
  • If CUDA is being used: CUDA version and GPU architecture No CUDA
  • BLAS/LAPACK implementation and version flexiblas 3.3.1 -> openblas 0.3.21

opoplawski avatar Sep 11 '23 03:09 opoplawski

I've realized that we are not testing with MPI_F08 in our CI, however we did a test here https://github.com/cp2k/dbcsr/issues/661#issuecomment-1621787249 and it worked. the only difference was GCC 13.1. I will add the test to the CI. In the meantime, I see some actions here:

  1. could you build with F08 and OpenMPI?
  2. any chance you can use GCC 13.1 and mpich with F08 in DBCSR?

alazzaro avatar Dec 12 '23 11:12 alazzaro

I've enabled -DUSE_MPI_F08=ON for the openmpi builds as well. Scratch builds are here (for a week or two)

F40 - gcc 13.2.1 mpich 4.1.2 - https://koji.fedoraproject.org/koji/taskinfo?taskID=110306721

Tests are still failing.

We are stuck with the version of the compiler in the distribution which is at 13.2.1 in all current Fedora releases.

Interestingly though, the tests are succeeding in F38:

https://koji.fedoraproject.org/koji/taskinfo?taskID=110306885

which is with mpich 4.0.3. So maybe it's more of an mpich issue than DBCSR. Though mpich's own basic test suite is passing.

Also different: openblas 0.3.21 -> 0.3.25

opoplawski avatar Dec 14 '23 03:12 opoplawski