pti-gpu
pti-gpu copied to clipboard
Feature requests for improving reporting mechanism
Hi, I'm planning to use onetrace as part of automation to collect profiling information about GPU offloaded scientific applications. Since onetrace is a light-weight cmd line tool, it is quite handy. However, I found the following missing features which if added will greatly enhance it's usability from a user perspective.
- Currently, onetrace expects a executable binary as input and chokes out if given a shell/run-script. It would be nice to have support for using onetrace with runscripts. This is a common usage since most apps have a shell-script that sets the relevant env. variables before launching the binary and with the current limitation, it is cumbersome to integrate onetrace into automation framework.
- For device time, separate out kernel execution time and memory transfer times.
- We currently have several Offload mechanism - OpenMP offload, OpenCL, DPCPP. Have the kernel section separate out by API type.
- Have a unique header tag in the log file (something like: onetrace version X.Y) to identify the log file as onetrace produced file for easier post processing.
- For (2) and (3), we can consider the output provided by onetrace equivalent for Nvidia (nsys). I'm attaching a sample log produced by nsys for one of apps running on Nvidia GPU's. Ideally, it would be great if onetrace can provide the stats as reported by nsys (esp. device transfer costs/bandwidth stats etc)
Using report7.sqlite for SQL queries.
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/cudaapisum.py report7.sqlite]...
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ----------- --------- ----------- ------------ ---------------------
49.7 647,951,615 401 1,615,839.4 1,780,889.0 1,291,574 1,941,593 313,742.7 cudaDeviceSynchronize
37.4 488,480,350 103 4,742,527.7 1,471,249.0 1,459,420 140,348,963 19,251,795.7 cudaMemcpy
12.6 164,669,246 4 41,167,311.5 666,255.0 63,206 163,273,530 81,404,642.1 cudaMalloc
0.2 2,064,936 4 516,234.0 628,231.5 46,267 762,206 320,452.5 cudaFree
0.1 1,608,923 501 3,211.4 2,987.0 2,721 30,239 1,554.2 cudaLaunchKernel
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpusum.py report7.sqlite]...
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Category Operation
-------- --------------- --------- ----------- ----------- --------- ----------- ------------ ----------- ----------------------------------------------------------
30.1 340,984,091 103 3,310,525.2 1,696.0 1,663 140,173,958 19,465,283.5 MEMORY_OPER [CUDA memcpy DtoH]
17.0 192,554,900 100 1,925,549.0 1,925,416.5 1,921,656 1,929,816 1,717.0 CUDA_KERNEL void add_kernel<double>(const T1 *, const T1 *, T1 *)
17.0 192,464,685 100 1,924,646.9 1,924,280.5 1,921,080 1,928,312 1,778.7 CUDA_KERNEL void triad_kernel<double>(T1 *, const T1 *, const T1 *)
12.9 145,723,283 100 1,457,232.8 1,457,242.0 1,446,138 1,471,835 5,201.3 CUDA_KERNEL void dot_kernel<double>(const T1 *, const T1 *, T1 *, int)
11.5 130,498,379 100 1,304,983.8 1,305,034.5 1,301,627 1,309,051 1,463.7 CUDA_KERNEL void mul_kernel<double>(T1 *, const T1 *)
11.4 129,086,388 100 1,290,863.9 1,290,826.0 1,287,547 1,303,259 1,817.5 CUDA_KERNEL void copy_kernel<double>(const T1 *, T1 *)
0.2 1,777,689 1 1,777,689.0 1,777,689.0 1,777,689 1,777,689 0.0 CUDA_KERNEL void init_kernel<double>(T1 *, T1 *, T1 *, T1, T1, T1)
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpumemsizesum.py report7.sqlite]...
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- --------- ----------- ------------------
3,221.430 103 31.276 0.002 0.002 1,073.742 181.443 [CUDA memcpy DtoH]
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpumemtimesum.py report7.sqlite]...
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- ----------- -------- -------- ----------- ------------ ------------------
100.0 340,984,091 103 3,310,525.2 1,696.0 1,663 140,173,958 19,465,283.5 [CUDA memcpy DtoH]
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/openaccsum.py report7.sqlite]... SKIPPED: report7.sqlite does not contain OpenACC event data.
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/openmpevtsum.py report7.sqlite]... SKIPPED: report7.sqlite does not contain OpenMP event data.