trust-code Optimize `op_conv_vef

This PR aims to optimize the large convective kernel in src/VEF/Operateurs/Op_Conv/Op_Conv_VEF_Face.cpp. It replaces temporary arrays in local memory by Kokkos views in scratch memory.

Mar 28 '25 11:03 pzehner

Thanks Paul, I will have a look monday.

Mar 28 '25 16:03 pledac

For Op_Conv_VEF_Face kernel, we notice between 18% and 34% speedup (Nvidia A6000) according our GPU test cases. On H100, the speedup drops between 6% and 14%.

And strangely, it seems slower on A100...

I merge your code into a local branch here cause the pattern is very interesting and that because now thanks to your work, we know that local static array is not using register but global memory and here replaced by faster scratch memory. The code can switch on the two implementations (with and wo scratch memory), by a TRUST_USE_SCRATCH_MEMORY environment variable to test.

Mar 28 '25 17:03 pledac

I add Adrien and Rémi to discuss about the benefice/complexity ratio introduced by using scratch memory. To give an idea 30% speedup is the probable gain by using the good layout on this kernel. What bothers me, for example, is the size of the warps set here to 32. Does this value GPU specific, is it the same on AMD, and what if in 10 years with future GPU cards ? Kokkos provide portability of performance, and in my poor understanding, developer should not care about this value.

According to Hari tests, scratch memory through hierarchical memory is not interesting on other kernels (like diffusion one).

Mar 28 '25 17:03 pledac

On MI250X AMD, the slowdown with scratch memory is between 7% and 25% (warp size 64?).

Mar 29 '25 12:03 pledac

Optimize `op_conv_vef_face` kernel