ginkgo
ginkgo copied to clipboard
Unifying kernel tests
At the moment, all of our non-reference kernel tests are basically the same (except for the templated tests, but I will get to that): We generate some random matrices or load one of the ani matrices, execute each of our kernels, maybe in multiple configurations on reference and the executor we are testing and compare the result for equality or near-equality.
This approach means a lot of code duplication, a lot of potential for inconsistencies and a lot of work in case we need to change an internal kernel interface.
With a very hacky and inaccurate bash one-liner, I looked at the amount of different lines between CUDA and HIP as well as OMP and CUDA/HIP for each test:
> for f in cuda/test/**/*.c*; do echo $f; ff=${f##cuda/}; fff=${ff%%.c*}; c1=$(diff -y --suppress-common-lines $f hip/${fff}.* | grep "^" | wc -l);c2=$(cat $f | wc -l); echo "$c1 / $c2" | bc -l; done
cuda/test/base/exception_helpers.cu
.18072289156626506024
cuda/test/base/math.cu
.07228915662650602409
cuda/test/components/cooperative_groups.cu
.33716475095785440613
cuda/test/components/merging.cu
.13103448275862068965
cuda/test/components/prefix_sum.cu
.03191489361702127659
cuda/test/components/searching.cu
.06530612244897959183
cuda/test/components/sorting.cu
.12056737588652482269
cuda/test/factorization/par_ilu_kernels.cpp
.26011560693641618497
cuda/test/matrix/coo_kernels.cpp
.04580152671755725190
cuda/test/matrix/csr_kernels.cpp
.11893583724569640062
cuda/test/matrix/dense_kernels.cpp
.12231759656652360515
cuda/test/matrix/ell_kernels.cpp
.04624277456647398843
cuda/test/matrix/hybrid_kernels.cpp
.07272727272727272727
cuda/test/matrix/sellp_kernels.cpp
.06060606060606060606
cuda/test/preconditioner/jacobi_kernels.cpp
.13129973474801061007
cuda/test/solver/bicg_kernels.cpp
.12885154061624649859
cuda/test/solver/bicgstab_kernels.cpp
.16713091922005571030
cuda/test/solver/cg_kernels.cpp
.15073529411764705882
cuda/test/solver/cgs_kernels.cpp
.17191977077363896848
cuda/test/solver/fcg_kernels.cpp
.14736842105263157894
cuda/test/solver/gmres_kernels.cpp
.15107913669064748201
cuda/test/solver/ir_kernels.cpp
.13768115942028985507
cuda/test/solver/lower_trs_kernels.cpp
.11695906432748538011
cuda/test/solver/upper_trs_kernels.cpp
.12209302325581395348
cuda/test/stop/criterion_kernels.cpp
.07619047619047619047
cuda/test/stop/residual_norm_reduction_kernels.cpp
.06164383561643835616
cuda/test/utils/assertions_test.cpp
.09638554216867469879
> for f in omp/test/**/*.cpp; do echo $f; ff=${f##omp/}; fff=${ff%%.cpp}; c1=$(diff -y --suppress-common-lines $f cuda/${fff}.* | grep "^" | wc -l);c2=$(cat $f | wc -l); echo "$c1 / $c2" | bc -l; done
omp/test/components/prefix_sum.cpp
.05376344086021505376
omp/test/factorization/par_ilu_kernels.cpp
.30769230769230769230
omp/test/matrix/coo_kernels.cpp
.24444444444444444444
omp/test/matrix/csr_kernels.cpp
.65886939571150097465
omp/test/matrix/dense_kernels.cpp
.56441717791411042944
omp/test/matrix/ell_kernels.cpp
.72398190045248868778
omp/test/matrix/hybrid_kernels.cpp
.07305936073059360730
omp/test/matrix/sellp_kernels.cpp
.47950819672131147540
omp/test/preconditioner/jacobi_kernels.cpp
.08366533864541832669
omp/test/solver/bicg_kernels.cpp
.20000000000000000000
omp/test/solver/bicgstab_kernels.cpp
.15449438202247191011
omp/test/solver/cg_kernels.cpp
.11808118081180811808
omp/test/solver/cgs_kernels.cpp
.14942528735632183908
omp/test/solver/fcg_kernels.cpp
.11619718309859154929
omp/test/solver/gmres_kernels.cpp
.13357400722021660649
omp/test/solver/ir_kernels.cpp
.12213740458015267175
omp/test/solver/lower_trs_kernels.cpp
.44221105527638190954
omp/test/solver/upper_trs_kernels.cpp
.45959595959595959595
omp/test/stop/criterion_kernels.cpp
.11111111111111111111
omp/test/stop/residual_norm_reduction_kernels.cpp
.40983606557377049180
Thus I would like to discuss whether you consider it sensible to unify the kernel tests into a single code structure, and if so, what the best approach to implementing this would be.
I'm thinking of a wrapper that amounts to something like
ref->run(make_operation(...));
for (auto exec : execs) {
try {
exec->run(make_operation(...));
} catch (NotImplemented...) {}
ASSERT_*
}
This would again cut down on the test runtimes (as I assume the reference implementations are often our bottleneck) and provide additional benefits, like a simple enumeration of missing kernels. This would also allow us to extend our GPU tests to the templated setting, while still avoiding issues with missing 64bit support in cuSPARSE/hipSPARSE.
I would of course still keep the individual test folders, since there are certain components etc. that are specific to the executors and should only be tested there.
In current test compilation, hip test has some link problem in ginkgo static library. We use gcc compiler but add the hip linker. It might cause some problems in combining the tests.
Another approach we could always take is add some macros to reduce the duplication and generate some of the boiler-plate code that most kernel tests have. It would probably not help in the compilation time, but would ease the writing of tests.