dbcsr
dbcsr copied to clipboard
Number of threads has changed! when running with cp2k.sdbg and more than 64 OMP threads
Describe the bug
we get the following error message when running certain tests with a fresh cp2k.sdbg
:
*******************************************************************************
* ___ *
* / \ *
* [ABORT] *
* \___/ Number of threads has changed! *
* | *
* O/| *
* /| | *
* / \ dbcsr_iterator_operations.F:179 *
*******************************************************************************
To Reproduce Steps to reproduce the behavior:
- Built with the command:
make ARCH=local VERSION=sdbg
with the arch file from the toolchain - Run like this:
cd tests/QS/regtest-ri-rpa-rse ; ../../../exe/local/cp2k.sdbg Cubic_RPA_RSE_H2.inp
- On the architecture/host/platform: openSUSE LEAP 15.2, GCC 7.5.0, OpenMPI 3.1.6; system-provded libopenblas_openmp; rest from toolchain
- See error
Setting OMP_NUM_THREADS=64
solves the issue.
reproducible with DBCSR itself, configured with: cmake -DTEST_MPI_RANKS=1 -DTEST_OMP_THREADS=72 ..
:
$ make test
Running tests...
Test project /data/tiziano/cp2k/exts/dbcsr/build
Start 1: dbcsr_perf:inputs/test_H2O.perf
1/17 Test #1: dbcsr_perf:inputs/test_H2O.perf ....................... Passed 72.55 sec
Start 2: dbcsr_perf:inputs/test_rect1_dense.perf
2/17 Test #2: dbcsr_perf:inputs/test_rect1_dense.perf ............... Passed 2.56 sec
Start 3: dbcsr_perf:inputs/test_rect1_sparse.perf
3/17 Test #3: dbcsr_perf:inputs/test_rect1_sparse.perf .............. Passed 10.91 sec
Start 4: dbcsr_perf:inputs/test_rect2_dense.perf
4/17 Test #4: dbcsr_perf:inputs/test_rect2_dense.perf ............... Passed 2.49 sec
Start 5: dbcsr_perf:inputs/test_rect2_sparse.perf
5/17 Test #5: dbcsr_perf:inputs/test_rect2_sparse.perf .............. Passed 10.36 sec
Start 6: dbcsr_perf:inputs/test_singleblock.perf
6/17 Test #6: dbcsr_perf:inputs/test_singleblock.perf ............... Passed 0.85 sec
Start 7: dbcsr_perf:inputs/test_square_dense.perf
7/17 Test #7: dbcsr_perf:inputs/test_square_dense.perf .............. Passed 1.09 sec
Start 8: dbcsr_perf:inputs/test_square_sparse.perf
8/17 Test #8: dbcsr_perf:inputs/test_square_sparse.perf ............. Passed 3.45 sec
Start 9: dbcsr_perf:inputs/test_square_sparse_bigblocks.perf
9/17 Test #9: dbcsr_perf:inputs/test_square_sparse_bigblocks.perf ... Passed 1.62 sec
Start 10: dbcsr_unittest1
10/17 Test #10: dbcsr_unittest1 ....................................... Passed 1372.54 sec
Start 11: dbcsr_unittest2
11/17 Test #11: dbcsr_unittest2 ....................................... Passed 236.76 sec
Start 12: dbcsr_unittest3
12/17 Test #12: dbcsr_unittest3 ....................................... Passed 308.31 sec
Start 13: dbcsr_unittest4
13/17 Test #13: dbcsr_unittest4 ....................................... Passed 0.89 sec
Start 14: dbcsr_tensor_unittest
14/17 Test #14: dbcsr_tensor_unittest .................................***Failed 4.51 sec
Start 15: dbcsr_tas_unittest
15/17 Test #15: dbcsr_tas_unittest .................................... Passed 3.59 sec
Start 16: dbcsr_test_csr_conversions
16/17 Test #16: dbcsr_test_csr_conversions ............................ Passed 10.47 sec
Start 17: dbcsr_tensor_test
17/17 Test #17: dbcsr_tensor_test ..................................... Passed 0.73 sec
94% tests passed, 1 tests failed out of 17
Total Test time (real) = 2043.68 sec
The following tests FAILED:
14 - dbcsr_tensor_unittest (Failed)
Errors while running CTest
make: *** [Makefile:124: test] Error 8
and Testing/Temporary/LastTest.log
shows for the relevant test:
[...]
--------------------------------------------------------------------------------
TAS MATRIX MULTIPLICATION DONE
--------------------------------------------------------------------------------
GLOBAL INFO OF (14|25)
block dimensions: 4 5 11 3
full dimensions: 25 32 83 28
process grid dimensions: 1 1 1 1
DISTRIBUTION OF (14|25)
Number of non-zero blocks: 26
Percentage of non-zero blocks: 3.94
Average number of blocks per CPU: 26
Maximum number of blocks per CPU: 26
Average number of matrix elements per CPU: 64680
Maximum number of matrix elements per CPU: 64680
*******************************************************************************
* ___ *
* / \ *
* [ABORT] *
* \___/ Number of threads has changed! *
* | *
* O/| *
* /| | *
* / \ dbcsr_iterator_operations.F:179 *
*******************************************************************************
===== Routine Calling Stack =====
4 dbcsr_iterator_start
3 dbcsr_filter_anytype
2 dbcsr_t_contract
1 dbcsr_t_total
[...]
it seems that the number of OMP threads gets capped to 64 at some point
Maybe this is caused by the NUM_THREADS=64
in install_openblas.sh?
Could be, should be easy to verify (ref-lapack, mkl, libsci). If it is indeed OpenBLAS, the question is what we should do. The DBCSR-only test above was with a system-provided OpenBLAS on an openSUSE-system.
We can:
- explicitly check OpenBLAS for number of threads when linking against OpenBLAS at initialization
- implicitly check by calling a BLAS routine at initialization
- reinit the first time this happens and then restrict future allocs to that lower number
- leave it to the user to provide a stable OMP env before calling into DBCSR
Could now reproduce this on a new (Apple silicon) MacBook Air. The only way around was to set Sorry, this was actually related to NO building with FFTW3, see cp2k/cp2k#1315OMP_NUM_THREADS=1
.