quda
quda copied to clipboard
Add fine-grained parallelism + matrix tiling to computeCoarseClover
The routine computeCoarseClover: https://github.com/lattice/quda/blob/develop/include/kernels/coarse_op_kernel.cuh#L1014
Does not exploit a huge amount of parallelism as implemented, which turns into a bit of a nightmare when autotuning and could be a blocker in use-cases where not coarsening the preconditioned op is desirable.
Just to note that computeCoarseClover already has fine-grain parallelism, and that #1050 improves the performance significantly of this kernel, although it does not yet reformulate it using the matrix-tiling abstraction.