dbcsr icon indicating copy to clipboard operation
dbcsr copied to clipboard

Slow memory management on Nvidia GPUs

Open fstein93 opened this issue 1 year ago • 7 comments

If a DBCSR-heavy calculation in CP2K (LS_SCF) is profiled on NVIDIA GPUs, it turns out that DBCSR spends a lot (most) of the time on allocating/freeing memory on GPUs (tested on H100). PM for additional data. Potentially, this may also be the case on AMD hardware.

fstein93 avatar Sep 10 '24 13:09 fstein93

We do test LS, namely H2O-DFT-LS. I don't see any connection with the GPU type, the data movement is GPU-agnostic. Could you post the DBCSR statistics and CP2K timers?

Specifically, for GPU data allocation we use memory pools, so I would not expect any big impact from that. I can assume these are allocations of the indices, which are async, so the effect should be minimal.

alazzaro avatar Sep 10 '24 13:09 alazzaro

I have recently ran tests on a GH200 system with the OpenCL backend. The OpenCL backend has support for profiling results to appear in DBCSRs / CP2Ks regular profile (end of execution). The allocations were visible for both host- and GPU-backed memory. Though, this can also depend on the node's configuration like amount of page-lockable memory, etc. Still, the time spent was relatively negligible compared to the total time to solution spent (wall time).

hfp avatar Sep 10 '24 13:09 hfp

These are the prototypes that allow to call CP2K/DBCSR's timer facility for instance in cuda_hip sources: https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h#L67-L68

hfp avatar Sep 10 '24 13:09 hfp

@alazzaro I have sent you a mail.

fstein93 avatar Sep 10 '24 13:09 fstein93

@alazzaro I have sent you a mail.

OK, so I've checked the slides and my understanding is that the problem is appearing on the first multiplications, which is expected. We do use memory pools with a factor size of 1 (if I recall correctly, we do use memory pools on the CPU too, with a factor size increase of 1.2). The main function which does memory settings (pools and allocations) is

https://github.com/cp2k/dbcsr/blob/f4e8c38dd79dedfe63a9615359ddccfa222e08e4/src/data/dbcsr_mem_methods.F#L207

Then, there is a function to ensure that size of the buffers is OK.

The part where this function is called is for the C matrix:

https://github.com/cp2k/dbcsr/blob/f4e8c38dd79dedfe63a9615359ddccfa222e08e4/src/mm/dbcsr_mm_cannon.F#L1199

where we also try to make an educate guess of the final size (per each thread).

Now, the occupancies of the matrices increase with the multiplications, up to a given plateau. So, in the first multiplications there is a reallocation of the memory, but then we use the memory loop and do not reallocate. So, the benchmark itself can have a bit of overhead, but in the real production runs (with many more multiplications) the effect is negligible. So the question is: do we see the memory allocations for all multiplications, i.e. the memory pool is never used?

I can image we make the resize_factor as an external parameters so that we can avoid reallocations (at the cost of large memory footprint).

cudaMallocAsync will require some refactoring, but I don't think it is worth the pain.

alazzaro avatar Sep 10 '24 14:09 alazzaro

@fstein93 was the issue discovered on GH200 like Alps?

hfp avatar Sep 13 '24 12:09 hfp

It was 8xH100 with 2 ranks per GPU. I did not run the tests.

fstein93 avatar Sep 13 '24 12:09 fstein93