neko gmres_device_solve performance bottlenecks

gmres_device_solve performance bottlenecks

Open pszi1ard opened this issue 1 year ago • 0 comments

During gmres_device_solve there are ilde gaps in the GPU utilization due to (IIUC):

global reductions (following glsc3_reduce_kernel) and
a small cpu kernel https://github.com/ExtremeFLOW/neko/blob/1753fa9e89bd83e52a704523acfa103c0fb0cbc3/src/krylov/bcknd/device/gmres_device.F90#L435

Both of these are preceded by D2H and followed by H2D copies, some of which are also operating on pageable memory on the host (at least the latter) which risks leading to blocking behavior on the host.

Some of the bottlenecks could be easily solved by moving the small kernel to the GPU and pinning host memory.

Moving the entire gmres to the GPU (and fusing it) would be best, but due to the global reduction this requires GPU-resident global reduction, hence either

SHMEM reductions
device-aware MPI and a persistent kernel that produces the partial reduction and signals to the CPU that the data is ready, then continue when the CPU signals that the data has been reduced

That latter is a more complex solution so unless this is a significant bottleneck at scale, it may not be worth implementing (although it may work quite well especially on APUs and similar architectures even without SHMEM).

Dec 21 '23 19:12 pszi1ard

neko neko copied to clipboard

gmres_device_solve performance bottlenecks

neko
neko copied to clipboard