pti-gpu icon indicating copy to clipboard operation
pti-gpu copied to clipboard

Feature requests for improving reporting mechanism

Open vamsi-sripathi opened this issue 3 years ago • 1 comments

Hi, I'm planning to use onetrace as part of automation to collect profiling information about GPU offloaded scientific applications. Since onetrace is a light-weight cmd line tool, it is quite handy. However, I found the following missing features which if added will greatly enhance it's usability from a user perspective.

  1. Currently, onetrace expects a executable binary as input and chokes out if given a shell/run-script. It would be nice to have support for using onetrace with runscripts. This is a common usage since most apps have a shell-script that sets the relevant env. variables before launching the binary and with the current limitation, it is cumbersome to integrate onetrace into automation framework.
  2. For device time, separate out kernel execution time and memory transfer times.
  3. We currently have several Offload mechanism - OpenMP offload, OpenCL, DPCPP. Have the kernel section separate out by API type.
  4. Have a unique header tag in the log file (something like: onetrace version X.Y) to identify the log file as onetrace produced file for easier post processing.
  5. For (2) and (3), we can consider the output provided by onetrace equivalent for Nvidia (nsys). I'm attaching a sample log produced by nsys for one of apps running on Nvidia GPU's. Ideally, it would be great if onetrace can provide the stats as reported by nsys (esp. device transfer costs/bandwidth stats etc)
Using report7.sqlite for SQL queries.
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/cudaapisum.py report7.sqlite]...

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)    Max (ns)    StdDev (ns)           Name
 --------  ---------------  ---------  ------------  -----------  ---------  -----------  ------------  ---------------------
     49.7      647,951,615        401   1,615,839.4  1,780,889.0  1,291,574    1,941,593     313,742.7  cudaDeviceSynchronize
     37.4      488,480,350        103   4,742,527.7  1,471,249.0  1,459,420  140,348,963  19,251,795.7  cudaMemcpy
     12.6      164,669,246          4  41,167,311.5    666,255.0     63,206  163,273,530  81,404,642.1  cudaMalloc
      0.2        2,064,936          4     516,234.0    628,231.5     46,267      762,206     320,452.5  cudaFree
      0.1        1,608,923        501       3,211.4      2,987.0      2,721       30,239       1,554.2  cudaLaunchKernel

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpusum.py report7.sqlite]...

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)    Max (ns)    StdDev (ns)    Category                            Operation
 --------  ---------------  ---------  -----------  -----------  ---------  -----------  ------------  -----------  ----------------------------------------------------------
     30.1      340,984,091        103  3,310,525.2      1,696.0      1,663  140,173,958  19,465,283.5  MEMORY_OPER  [CUDA memcpy DtoH]
     17.0      192,554,900        100  1,925,549.0  1,925,416.5  1,921,656    1,929,816       1,717.0  CUDA_KERNEL  void add_kernel<double>(const T1 *, const T1 *, T1 *)
     17.0      192,464,685        100  1,924,646.9  1,924,280.5  1,921,080    1,928,312       1,778.7  CUDA_KERNEL  void triad_kernel<double>(T1 *, const T1 *, const T1 *)
     12.9      145,723,283        100  1,457,232.8  1,457,242.0  1,446,138    1,471,835       5,201.3  CUDA_KERNEL  void dot_kernel<double>(const T1 *, const T1 *, T1 *, int)
     11.5      130,498,379        100  1,304,983.8  1,305,034.5  1,301,627    1,309,051       1,463.7  CUDA_KERNEL  void mul_kernel<double>(T1 *, const T1 *)
     11.4      129,086,388        100  1,290,863.9  1,290,826.0  1,287,547    1,303,259       1,817.5  CUDA_KERNEL  void copy_kernel<double>(const T1 *, T1 *)
      0.2        1,777,689          1  1,777,689.0  1,777,689.0  1,777,689    1,777,689           0.0  CUDA_KERNEL  void init_kernel<double>(T1 *, T1 *, T1 *, T1, T1, T1)

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpumemsizesum.py report7.sqlite]...

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)   StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  ---------  -----------  ------------------
  3,221.430    103    31.276     0.002     0.002  1,073.742      181.443  [CUDA memcpy DtoH]

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpumemtimesum.py report7.sqlite]...

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)    StdDev (ns)       Operation
 --------  ---------------  -----  -----------  --------  --------  -----------  ------------  ------------------
    100.0      340,984,091    103  3,310,525.2   1,696.0     1,663  140,173,958  19,465,283.5  [CUDA memcpy DtoH]

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/openaccsum.py report7.sqlite]... SKIPPED: report7.sqlite does not contain OpenACC event data.

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/openmpevtsum.py report7.sqlite]... SKIPPED: report7.sqlite does not contain OpenMP event data.

vamsi-sripathi avatar Jun 09 '22 00:06 vamsi-sripathi