dbcsr
dbcsr copied to clipboard
Slow memory management on Nvidia GPUs
If a DBCSR-heavy calculation in CP2K (LS_SCF) is profiled on NVIDIA GPUs, it turns out that DBCSR spends a lot (most) of the time on allocating/freeing memory on GPUs (tested on H100). PM for additional data. Potentially, this may also be the case on AMD hardware.
We do test LS, namely H2O-DFT-LS. I don't see any connection with the GPU type, the data movement is GPU-agnostic. Could you post the DBCSR statistics and CP2K timers?
Specifically, for GPU data allocation we use memory pools, so I would not expect any big impact from that. I can assume these are allocations of the indices, which are async, so the effect should be minimal.
I have recently ran tests on a GH200 system with the OpenCL backend. The OpenCL backend has support for profiling results to appear in DBCSRs / CP2Ks regular profile (end of execution). The allocations were visible for both host- and GPU-backed memory. Though, this can also depend on the node's configuration like amount of page-lockable memory, etc. Still, the time spent was relatively negligible compared to the total time to solution spent (wall time).
These are the prototypes that allow to call CP2K/DBCSR's timer facility for instance in cuda_hip sources:
https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h#L67-L68
@alazzaro I have sent you a mail.
@alazzaro I have sent you a mail.
OK, so I've checked the slides and my understanding is that the problem is appearing on the first multiplications, which is expected. We do use memory pools with a factor size of 1 (if I recall correctly, we do use memory pools on the CPU too, with a factor size increase of 1.2). The main function which does memory settings (pools and allocations) is
https://github.com/cp2k/dbcsr/blob/f4e8c38dd79dedfe63a9615359ddccfa222e08e4/src/data/dbcsr_mem_methods.F#L207
Then, there is a function to ensure that size of the buffers is OK.
The part where this function is called is for the C matrix:
https://github.com/cp2k/dbcsr/blob/f4e8c38dd79dedfe63a9615359ddccfa222e08e4/src/mm/dbcsr_mm_cannon.F#L1199
where we also try to make an educate guess of the final size (per each thread).
Now, the occupancies of the matrices increase with the multiplications, up to a given plateau. So, in the first multiplications there is a reallocation of the memory, but then we use the memory loop and do not reallocate. So, the benchmark itself can have a bit of overhead, but in the real production runs (with many more multiplications) the effect is negligible. So the question is: do we see the memory allocations for all multiplications, i.e. the memory pool is never used?
I can image we make the resize_factor as an external parameters so that we can avoid reallocations (at the cost of large memory footprint).
cudaMallocAsync will require some refactoring, but I don't think it is worth the pain.
@fstein93 was the issue discovered on GH200 like Alps?
It was 8xH100 with 2 ranks per GPU. I did not run the tests.