DLA-Future icon indicating copy to clipboard operation
DLA-Future copied to clipboard

Some unit tests are too slow

Open teonnik opened this issue 2 years ago • 4 comments

Anything over 5-10s is IMO too much. The waiting time may discourage developers from running unit tests frequently enough. On my laptop with an Intel i7-8550U (8) @ 4.000GHz, the following tests take too long to execute:

43/51 Test #43: test_reduction_to_band ...........   Passed   78.29 sec
44/51 Test #44: test_bt_reduction_to_band ........   Passed   55.41 sec
45/51 Test #45: test_gen_to_std ..................   Passed   17.93 sec
46/51 Test #46: test_cholesky ....................   Passed   15.64 sec
47/51 Test #47: test_compute_t_factor ............   Passed   28.92 sec
49/51 Test #49: test_multiplication_triangular ...   Passed  110.95 sec
51/51 Test #51: test_triangular ..................   Passed  349.46 sec

teonnik avatar Jun 12 '22 07:06 teonnik

I agree with the tests taking very long to finish. There may be multiple reasons for it, but I wonder if one of them is simply oversubscription. I think most of those tests run with 6 ranks, and probably an unconstrained number of threads. Just for comparison, could you try running e.g. test_triangular with --pika:threads=2 --pika:bind=none? There would still be oversubscription, but not as much so I would maybe expect the test to finish faster.

msimberg avatar Jun 13 '22 07:06 msimberg

Yes, indeed, that helped. Unit test speed up more than twice. For example:

test_compute_t_factor ~ 10s test_triangular ~ 150s

but even so, some tests still take a while to finish.

teonnik avatar Jun 13 '22 10:06 teonnik

Do we have a way to restrict a unit test to only use e.g. 4 ranks (for your case of 8 cores), or similar?

msimberg avatar Jun 13 '22 10:06 msimberg

The problem with the triangular solver and multiplication are the 24 different cases that has to be tested. (left/right, upper/lover, non/trans/conj non/diag) on different grid (Note: distributed triangular multiplication doesn't support transposed and conj yet, therefore is faster)

Implementing a cmake flag to reduce 6 rank tests to 4 ranks should be easy, but it remove the most important test: a non square-grid with non trivial communicators in both dimension.

This issue can be linked with #557. My idea is to split some of the tests (blas/lapack/dlaf algotithms) in two parts:

  • a unit test with few representative fast tests (few grids and selected tests)
  • a more intensive test with more cases (all grids and larger matrices)

Other possible TODOs:

  • separate the local and the distributed tests of the dlaf algorithms (currently the local tests are executed simultaneously on all ranks)
  • improve the CMAKE logic that setups test, to add --pika:threads=? --pika:bind=none when slurm is not available.

rasolca avatar Jun 14 '22 13:06 rasolca