neko
neko copied to clipboard
non-overlapped D2D memcopies and memsets
I see a number of memset and device-to-device memcopies none of which is overlapped with compute. Based on a Leonardo TGV 256k run there are up to ~3% wall-time spent in these.
- In
cg_device_solve
prior toglsc3_kernel
: two memsets (x_d
,p_d
) and 1 D2D copy (f_d
copied tor_d
): https://github.com/ExtremeFLOW/neko/blob/develop/src/krylov/bcknd/device/cg_device.f90#L180-L182; since these are the CG part this is likely on the critical path -
hsmg_solve
starts with a number of D2D copies, none of which are overlapped and which can affect the subsequent CG phase - just before the 'Pressure solve" region a D2D
- right at the beginning of the "Pressure solve region" a D2D
- five D2D copies at the beginning of the "Fuid" region
It could be worth looking into whether these copies/zeroing are needed and if so whether they can be overlapped with work for some "free" performance, in particular for the former two which likely impact the performance of the CG region.