mahout
mahout copied to clipboard
[QDP] improve memory management
Summary
we have a lot of costs from cudaMalloc and cudaFree. I think we need to change recent method to Staging Buffer Pool way.
NVTX Range Summary (nvtx_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- ---------- ---------- -------- -------- ----------- -------- ------------------
50.2 22777979 1 22777979.0 22777979.0 22777979 22777979 0.0 StartEnd :Mahout::Encode
21.8 9875718 1 9875718.0 9875718.0 9875718 9875718 0.0 StartEnd :CPU::L2Norm
21.4 9712611 1 9712611.0 9712611.0 9712611 9712611 0.0 StartEnd :GPU::Alloc
5.8 2631884 1 2631884.0 2631884.0 2631884 2631884 0.0 StartEnd :GPU::H2DCopy
0.7 340224 1 340224.0 340224.0 340224 340224 0.0 StartEnd :GPU::KernelLaunch
0.1 42253 1 42253.0 42253.0 42253 42253 0.0 StartEnd :GPU::Synchronize
0.0 336 1 336.0 336.0 336 336 0.0 StartEnd :DLPack::Wrap
Processing [benchmark/nvtx_profile_obs.sqlite] with [/opt/nvidia/nsight-systems/2024.5.1/host-linux-x64/reports/cuda_api_sum.py]...
CUDA API Summary (cuda_api_sum):
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------------------
56.8 6233567 2 3116783.5 3116783.5 16689 6216878 4384195.7 cuMemAllocAsync
23.0 2523663 1 2523663.0 2523663.0 2523663 2523663 0.0 cuMemcpyHtoDAsync_v2
15.5 1700184 2 850092.0 850092.0 123903 1576281 1026986.3 cudaMemGetInfo
3.0 331958 1 331958.0 331958.0 331958 331958 0.0 cudaLaunchKernel
0.9 103171 2 51585.5 51585.5 38644 64527 18302.0 cuStreamSynchronize
0.4 48043 412 116.6 88.0 53 4443 224.1 cuGetProcAddress_v2
0.2 19526 9 2169.6 1614.0 193 7217 2304.0 cuCtxSetCurrent
0.0 4507 2 2253.5 2253.5 1239 3268 1434.7 cuMemFreeAsync
0.0 1956 1 1956.0 1956.0 1956 1956 0.0 cuInit
0.0 1489 1 1489.0 1489.0 1489 1489 0.0 cuEventCreate
0.0 788 1 788.0 788.0 788 788 0.0 cuEventDestroy_v2