DLA-Future
DLA-Future copied to clipboard
Some unit tests are too slow
Anything over 5-10s is IMO too much. The waiting time may discourage developers from running unit tests frequently enough. On my laptop with an Intel i7-8550U (8) @ 4.000GHz, the following tests take too long to execute:
43/51 Test #43: test_reduction_to_band ........... Passed 78.29 sec
44/51 Test #44: test_bt_reduction_to_band ........ Passed 55.41 sec
45/51 Test #45: test_gen_to_std .................. Passed 17.93 sec
46/51 Test #46: test_cholesky .................... Passed 15.64 sec
47/51 Test #47: test_compute_t_factor ............ Passed 28.92 sec
49/51 Test #49: test_multiplication_triangular ... Passed 110.95 sec
51/51 Test #51: test_triangular .................. Passed 349.46 sec
I agree with the tests taking very long to finish. There may be multiple reasons for it, but I wonder if one of them is simply oversubscription. I think most of those tests run with 6 ranks, and probably an unconstrained number of threads. Just for comparison, could you try running e.g. test_triangular
with --pika:threads=2 --pika:bind=none
? There would still be oversubscription, but not as much so I would maybe expect the test to finish faster.
Yes, indeed, that helped. Unit test speed up more than twice. For example:
test_compute_t_factor ~ 10s
test_triangular ~ 150s
but even so, some tests still take a while to finish.
Do we have a way to restrict a unit test to only use e.g. 4 ranks (for your case of 8 cores), or similar?
The problem with the triangular solver and multiplication are the 24 different cases that has to be tested. (left/right, upper/lover, non/trans/conj non/diag) on different grid (Note: distributed triangular multiplication doesn't support transposed and conj yet, therefore is faster)
Implementing a cmake flag to reduce 6 rank tests to 4 ranks should be easy, but it remove the most important test: a non square-grid with non trivial communicators in both dimension.
This issue can be linked with #557. My idea is to split some of the tests (blas/lapack/dlaf algotithms) in two parts:
- a unit test with few representative fast tests (few grids and selected tests)
- a more intensive test with more cases (all grids and larger matrices)
Other possible TODOs:
- separate the local and the distributed tests of the dlaf algorithms (currently the local tests are executed simultaneously on all ranks)
- improve the CMAKE logic that setups test, to add
--pika:threads=? --pika:bind=none
when slurm is not available.