[QDP] improve memory management

Open rich7420 opened this issue 1 month ago • 0 comments

Summary

we have a lot of costs from cudaMalloc and cudaFree. I think we need to change recent method to Staging Buffer Pool way.

NVTX Range Summary (nvtx_sum):

 Time (%)  Total Time (ns)  Instances   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)   Style          Range       
 --------  ---------------  ---------  ----------  ----------  --------  --------  -----------  --------  ------------------
     50.2         22777979          1  22777979.0  22777979.0  22777979  22777979          0.0  StartEnd  :Mahout::Encode   
     21.8          9875718          1   9875718.0   9875718.0   9875718   9875718          0.0  StartEnd  :CPU::L2Norm      
     21.4          9712611          1   9712611.0   9712611.0   9712611   9712611          0.0  StartEnd  :GPU::Alloc       
      5.8          2631884          1   2631884.0   2631884.0   2631884   2631884          0.0  StartEnd  :GPU::H2DCopy     
      0.7           340224          1    340224.0    340224.0    340224    340224          0.0  StartEnd  :GPU::KernelLaunch
      0.1            42253          1     42253.0     42253.0     42253     42253          0.0  StartEnd  :GPU::Synchronize 
      0.0              336          1       336.0       336.0       336       336          0.0  StartEnd  :DLPack::Wrap     

Processing [benchmark/nvtx_profile_obs.sqlite] with [/opt/nvidia/nsight-systems/2024.5.1/host-linux-x64/reports/cuda_api_sum.py]... 

  CUDA API Summary (cuda_api_sum): 

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)           Name         
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------
     56.8          6233567          2  3116783.5  3116783.5     16689   6216878    4384195.7  cuMemAllocAsync       
     23.0          2523663          1  2523663.0  2523663.0   2523663   2523663          0.0  cuMemcpyHtoDAsync_v2  
     15.5          1700184          2   850092.0   850092.0    123903   1576281    1026986.3  cudaMemGetInfo        
      3.0           331958          1   331958.0   331958.0    331958    331958          0.0  cudaLaunchKernel      
      0.9           103171          2    51585.5    51585.5     38644     64527      18302.0  cuStreamSynchronize   
      0.4            48043        412      116.6       88.0        53      4443        224.1  cuGetProcAddress_v2   
      0.2            19526          9     2169.6     1614.0       193      7217       2304.0  cuCtxSetCurrent       
      0.0             4507          2     2253.5     2253.5      1239      3268       1434.7  cuMemFreeAsync        
      0.0             1956          1     1956.0     1956.0      1956      1956          0.0  cuInit                
      0.0             1489          1     1489.0     1489.0      1489      1489          0.0  cuEventCreate         
      0.0              788          1      788.0      788.0       788       788          0.0  cuEventDestroy_v2

Dec 08 '25 03:12 rich7420