quda Add fine-grained parallelism + matrix tiling to computeCoarseClover

Add fine-grained parallelism + matrix tiling to computeCoarseClover

Open weinbe2 opened this issue 5 years ago • 1 comments

The routine computeCoarseClover: https://github.com/lattice/quda/blob/develop/include/kernels/coarse_op_kernel.cuh#L1014

Does not exploit a huge amount of parallelism as implemented, which turns into a bit of a nightmare when autotuning and could be a blocker in use-cases where not coarsening the preconditioned op is desirable.

Jul 25 '20 19:07 weinbe2

Just to note that computeCoarseClover already has fine-grain parallelism, and that #1050 improves the performance significantly of this kernel, although it does not yet reformulate it using the matrix-tiling abstraction.

Aug 21 '20 00:08 maddyscientist

quda quda copied to clipboard

Add fine-grained parallelism + matrix tiling to computeCoarseClover

quda
quda copied to clipboard