Laghos
Laghos copied to clipboard
Excessive device memory wastage
Hi developers, our team has developed a GPU device memory profiling tool and found excessive device memory wastage in Laghos. For example, the vector q_dt_est
, q_e
, e_vec
, q_dx
, q_dv
in class QUpdate
were used in very early stage and were released until the class QUpdate
object reclamation, etc. I used some naive optimization methods, such as explicitly releasing some objects in advance which were used in the early stage and present in the whole application liveness. And it gets a good device memory peak reduction (at least 35%). Could you please optimize these kinds of inefficiencies?
Thank you for taking the time to investigate memory utilization.
Most objects are allocated and computed during setup and are especially designed to remain on the device to reduce memory transfer.
When used in conjunction with Umpire, MFEM's memory management now allows temporary memory pools. When possible, such capacity might be advantageous for some of the largest vectors.
What GPU memory profiling software do you use?
Hello, thanks so much for your response. I use a GPU profiling tool developed by our team. It's still under development and will be open-source after finishing. I maybe said something ambiguous before. We found some objects (actual cudaMalloc objects) were never used after some kernel, while they were not released until the application ends. So we can free those objects after the last accessed kernels, which will not bring any memory transfer later but can save memory usage peak. For example, there are an object A
and a kernel sequence: k1
, k2
, k3
, and k4
. And k1
and k2
access A
, k3
and k4
do not. The original application releases A
after all kernel executions (after k4
). But we can release A
after k2
, so the memory usage of k3
and k4
will be reduced.
Any opinion?
I looked at the main objects that might be released, and the mesh's cached GeometricFactors could be destroyed or generated in a temporary memory area after the setup stage.
There are also specific functions that users can utilize to remove certain tables. I see how it may reduce peak memory consumption, however it would require certain allocations during time stepping, something this miniapp explicitly avoided.
Please notify us when the GPU profiling tool will be available: I'd be happy to give it a try and see how it operates.