neko icon indicating copy to clipboard operation
neko copied to clipboard

non-overlapped D2D memcopies and memsets

Open pszi1ard opened this issue 1 year ago • 0 comments

I see a number of memset and device-to-device memcopies none of which is overlapped with compute. Based on a Leonardo TGV 256k run there are up to ~3% wall-time spent in these.

  • In cg_device_solve prior to glsc3_kernel: two memsets (x_d, p_d) and 1 D2D copy (f_d copied to r_d): https://github.com/ExtremeFLOW/neko/blob/develop/src/krylov/bcknd/device/cg_device.f90#L180-L182; since these are the CG part this is likely on the critical path
  • hsmg_solve starts with a number of D2D copies, none of which are overlapped and which can affect the subsequent CG phase
  • just before the 'Pressure solve" region a D2D
  • right at the beginning of the "Pressure solve region" a D2D
  • five D2D copies at the beginning of the "Fuid" region

It could be worth looking into whether these copies/zeroing are needed and if so whether they can be overlapped with work for some "free" performance, in particular for the former two which likely impact the performance of the CG region.

pszi1ard avatar Dec 22 '23 13:12 pszi1ard