dbcsr icon indicating copy to clipboard operation
dbcsr copied to clipboard

Number of threads has changed! when running with cp2k.sdbg and more than 64 OMP threads

Open dev-zero opened this issue 3 years ago • 5 comments

Describe the bug

we get the following error message when running certain tests with a fresh cp2k.sdbg:

 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]                                                                     *
 *  \___/                     Number of threads has changed!                   *
 *    |                                                                        *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                         dbcsr_iterator_operations.F:179 *
 *******************************************************************************

To Reproduce Steps to reproduce the behavior:

  1. Built with the command: make ARCH=local VERSION=sdbg with the arch file from the toolchain
  2. Run like this: cd tests/QS/regtest-ri-rpa-rse ; ../../../exe/local/cp2k.sdbg Cubic_RPA_RSE_H2.inp
  3. On the architecture/host/platform: openSUSE LEAP 15.2, GCC 7.5.0, OpenMPI 3.1.6; system-provded libopenblas_openmp; rest from toolchain
  4. See error

Setting OMP_NUM_THREADS=64 solves the issue.

dev-zero avatar Jan 08 '21 18:01 dev-zero

reproducible with DBCSR itself, configured with: cmake -DTEST_MPI_RANKS=1 -DTEST_OMP_THREADS=72 ..:

$ make test
Running tests...
Test project /data/tiziano/cp2k/exts/dbcsr/build
      Start  1: dbcsr_perf:inputs/test_H2O.perf
 1/17 Test  #1: dbcsr_perf:inputs/test_H2O.perf .......................   Passed   72.55 sec
      Start  2: dbcsr_perf:inputs/test_rect1_dense.perf
 2/17 Test  #2: dbcsr_perf:inputs/test_rect1_dense.perf ...............   Passed    2.56 sec
      Start  3: dbcsr_perf:inputs/test_rect1_sparse.perf
 3/17 Test  #3: dbcsr_perf:inputs/test_rect1_sparse.perf ..............   Passed   10.91 sec
      Start  4: dbcsr_perf:inputs/test_rect2_dense.perf
 4/17 Test  #4: dbcsr_perf:inputs/test_rect2_dense.perf ...............   Passed    2.49 sec
      Start  5: dbcsr_perf:inputs/test_rect2_sparse.perf
 5/17 Test  #5: dbcsr_perf:inputs/test_rect2_sparse.perf ..............   Passed   10.36 sec
      Start  6: dbcsr_perf:inputs/test_singleblock.perf
 6/17 Test  #6: dbcsr_perf:inputs/test_singleblock.perf ...............   Passed    0.85 sec
      Start  7: dbcsr_perf:inputs/test_square_dense.perf
 7/17 Test  #7: dbcsr_perf:inputs/test_square_dense.perf ..............   Passed    1.09 sec
      Start  8: dbcsr_perf:inputs/test_square_sparse.perf
 8/17 Test  #8: dbcsr_perf:inputs/test_square_sparse.perf .............   Passed    3.45 sec
      Start  9: dbcsr_perf:inputs/test_square_sparse_bigblocks.perf
 9/17 Test  #9: dbcsr_perf:inputs/test_square_sparse_bigblocks.perf ...   Passed    1.62 sec
      Start 10: dbcsr_unittest1
10/17 Test #10: dbcsr_unittest1 .......................................   Passed  1372.54 sec
      Start 11: dbcsr_unittest2
11/17 Test #11: dbcsr_unittest2 .......................................   Passed  236.76 sec
      Start 12: dbcsr_unittest3
12/17 Test #12: dbcsr_unittest3 .......................................   Passed  308.31 sec
      Start 13: dbcsr_unittest4
13/17 Test #13: dbcsr_unittest4 .......................................   Passed    0.89 sec
      Start 14: dbcsr_tensor_unittest
14/17 Test #14: dbcsr_tensor_unittest .................................***Failed    4.51 sec
      Start 15: dbcsr_tas_unittest
15/17 Test #15: dbcsr_tas_unittest ....................................   Passed    3.59 sec
      Start 16: dbcsr_test_csr_conversions
16/17 Test #16: dbcsr_test_csr_conversions ............................   Passed   10.47 sec
      Start 17: dbcsr_tensor_test
17/17 Test #17: dbcsr_tensor_test .....................................   Passed    0.73 sec

94% tests passed, 1 tests failed out of 17

Total Test time (real) = 2043.68 sec

The following tests FAILED:
	14 - dbcsr_tensor_unittest (Failed)
Errors while running CTest
make: *** [Makefile:124: test] Error 8

and Testing/Temporary/LastTest.log shows for the relevant test:

[...]
--------------------------------------------------------------------------------
TAS MATRIX MULTIPLICATION DONE
--------------------------------------------------------------------------------
 GLOBAL INFO OF (14|25)
   block dimensions:      4     5    11     3
   full dimensions:       25      32      83      28
   process grid dimensions:      1     1     1     1

 DISTRIBUTION OF (14|25)
              Number of non-zero blocks:                                      26
              Percentage of non-zero blocks:                                3.94
              Average number of blocks per CPU:                               26
              Maximum number of blocks per CPU:                               26
              Average number of matrix elements per CPU:                   64680
              Maximum number of matrix elements per CPU:                   64680

 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]                                                                     *
 *  \___/                     Number of threads has changed!                   *
 *    |                                                                        *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                         dbcsr_iterator_operations.F:179 *
 *******************************************************************************


 ===== Routine Calling Stack ===== 

            4 dbcsr_iterator_start
            3 dbcsr_filter_anytype
            2 dbcsr_t_contract
            1 dbcsr_t_total
[...]

dev-zero avatar Jan 08 '21 19:01 dev-zero

it seems that the number of OMP threads gets capped to 64 at some point

dev-zero avatar Jan 13 '21 09:01 dev-zero

Maybe this is caused by the NUM_THREADS=64 in install_openblas.sh?

oschuett avatar Jan 15 '21 09:01 oschuett

Could be, should be easy to verify (ref-lapack, mkl, libsci). If it is indeed OpenBLAS, the question is what we should do. The DBCSR-only test above was with a system-provided OpenBLAS on an openSUSE-system.

We can:

  • explicitly check OpenBLAS for number of threads when linking against OpenBLAS at initialization
  • implicitly check by calling a BLAS routine at initialization
  • reinit the first time this happens and then restrict future allocs to that lower number
  • leave it to the user to provide a stable OMP env before calling into DBCSR

dev-zero avatar Jan 15 '21 10:01 dev-zero

Could now reproduce this on a new (Apple silicon) MacBook Air. The only way around was to set OMP_NUM_THREADS=1. Sorry, this was actually related to NO building with FFTW3, see cp2k/cp2k#1315

dev-zero avatar Jan 18 '21 12:01 dev-zero