warpx
warpx copied to clipboard
[WIP] Profiler tags for visualizing BTD in NsightSystem
Looking at a Nsight profile I made a while ago for HiPACE++ I noticed two things:
-
Properly synchronized function/profiler names are available under Processes/[…] hipace/CUDA HW …/99.7% Context 1/82.4% Stream 17/NVTX So basically, not in Threads and not in [All Streams] but in the category of the most used Stream(s).
-
The automatic synchronizing after a MFIter loop (and likely also ParIter) can take 20-30µs with one Rank, one GPU and one MF Box. If BTD uses a lot of these with little compute load, this might be the reason for the slow performance.