Laghos icon indicating copy to clipboard operation
Laghos copied to clipboard

Excessive device memory wastage

Open Lin-Mao opened this issue 2 years ago • 4 comments

Hi developers, our team has developed a GPU device memory profiling tool and found excessive device memory wastage in Laghos. For example, the vector q_dt_est, q_e, e_vec, q_dx, q_dv in class QUpdate were used in very early stage and were released until the class QUpdate object reclamation, etc. I used some naive optimization methods, such as explicitly releasing some objects in advance which were used in the early stage and present in the whole application liveness. And it gets a good device memory peak reduction (at least 35%). Could you please optimize these kinds of inefficiencies?

Lin-Mao avatar Jun 02 '22 22:06 Lin-Mao

Thank you for taking the time to investigate memory utilization.

Most objects are allocated and computed during setup and are especially designed to remain on the device to reduce memory transfer.

When used in conjunction with Umpire, MFEM's memory management now allows temporary memory pools. When possible, such capacity might be advantageous for some of the largest vectors.

What GPU memory profiling software do you use?

camierjs avatar Jun 08 '22 01:06 camierjs

Hello, thanks so much for your response. I use a GPU profiling tool developed by our team. It's still under development and will be open-source after finishing. I maybe said something ambiguous before. We found some objects (actual cudaMalloc objects) were never used after some kernel, while they were not released until the application ends. So we can free those objects after the last accessed kernels, which will not bring any memory transfer later but can save memory usage peak. For example, there are an object A and a kernel sequence: k1, k2, k3, and k4. And k1 and k2 access A, k3 and k4 do not. The original application releases A after all kernel executions (after k4). But we can release A after k2, so the memory usage of k3 and k4 will be reduced.

Lin-Mao avatar Jun 08 '22 18:06 Lin-Mao

Any opinion?

Lin-Mao avatar Jun 15 '22 18:06 Lin-Mao

I looked at the main objects that might be released, and the mesh's cached GeometricFactors could be destroyed or generated in a temporary memory area after the setup stage.

There are also specific functions that users can utilize to remove certain tables. I see how it may reduce peak memory consumption, however it would require certain allocations during time stepping, something this miniapp explicitly avoided.

Please notify us when the GPU profiling tool will be available: I'd be happy to give it a try and see how it operates.

camierjs avatar Jun 16 '22 21:06 camierjs