neko
neko copied to clipboard
gmres_device_solve performance bottlenecks
During gmres_device_solve
there are ilde gaps in the GPU utilization due to (IIUC):
- global reductions (following glsc3_reduce_kernel) and
- a small cpu kernel https://github.com/ExtremeFLOW/neko/blob/1753fa9e89bd83e52a704523acfa103c0fb0cbc3/src/krylov/bcknd/device/gmres_device.F90#L435
Both of these are preceded by D2H and followed by H2D copies, some of which are also operating on pageable memory on the host (at least the latter) which risks leading to blocking behavior on the host.
Some of the bottlenecks could be easily solved by moving the small kernel to the GPU and pinning host memory.
Moving the entire gmres to the GPU (and fusing it) would be best, but due to the global reduction this requires GPU-resident global reduction, hence either
- SHMEM reductions
- device-aware MPI and a persistent kernel that produces the partial reduction and signals to the CPU that the data is ready, then continue when the CPU signals that the data has been reduced
That latter is a more complex solution so unless this is a significant bottleneck at scale, it may not be worth implementing (although it may work quite well especially on APUs and similar architectures even without SHMEM).