kineto Fail to profile CUDA activities when ProfilerActivity.CPU is not enabled

Describe the issue

Profiler fails to profile CUDA, and only CPU time is printed as the result. When CUDA is set as the only profiled activity, the following message is produced:

Warning: Failed to recover relationship between all profiler and kineto events: 100 vs. 0  reassociated. (function reassociate)

Minimum code to reproduce

Case 1: set activities to only ProfilerActivity.CUDA

import torch
from torch.profiler import profile, record_function, ProfilerActivity
prof = profile(
        activities=[ProfilerActivity.CUDA],
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=10, repeat=1),        
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/model'),
        profile_memory=True,
        record_shapes=True,
        with_stack=True)

prof.start()
for _ in range(100):
    y = torch.randn(1).cuda() + torch.randn(1).cuda()
    prof.step()
            
print(prof.key_averages())

prof.stop()

Case 2: set activities to [ProfilerActivity.CPU, ProfilerActivity.CUDA]

prof = profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
...

Code Result

Case 1

STAGE:2024-01-21 17:27:31 1356:1356 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-21 17:27:31 1356:1356 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-21 17:27:31 1356:1356 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
[W collection.cpp:929] Warning: Failed to recover relationship between all profiler and kineto events: 100 vs. 0  reassociated. (function reassociate)
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                 [memory]         0.00%       0.000us         0.00%       0.000us       0.000us           0 b           0 b           100
          cudaMemcpyAsync        23.86%     172.000us        23.86%     172.000us       8.600us           0 b           0 b            20
    cudaStreamSynchronize        61.17%     441.000us        61.17%     441.000us      22.050us           0 b           0 b            20
         cudaLaunchKernel        14.70%     106.000us        14.70%     106.000us      10.600us           0 b           0 b            10
    cudaDeviceSynchronize         0.28%       2.000us         0.28%       2.000us       2.000us           0 b           0 b             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 721.000us

Case 2

STAGE:2024-01-21 17:29:55 1376:1376 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-21 17:29:55 1376:1376 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-21 17:29:55 1376:1376 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
            ProfilerStep*         6.63%     146.000us        99.91%       2.201ms     220.100us           0 b         -80 b           0 b     -17.00 Kb            10
              aten::randn         3.31%      73.000us         5.90%     130.000us       6.500us          80 b           8 b           0 b           0 b            20
              aten::empty         1.68%      37.000us         1.68%      37.000us       1.850us          72 b          72 b           0 b           0 b            20
            aten::normal_         1.50%      33.000us         1.50%      33.000us       1.650us           0 b           0 b           0 b           0 b            20
                 aten::to        12.98%     286.000us        67.82%       1.494ms      74.700us           0 b           0 b      10.00 Kb       2.00 Kb            20
           aten::_to_copy         3.00%      66.000us        65.50%       1.443ms      72.150us           0 b           0 b      10.00 Kb           0 b            20
      aten::empty_strided         4.54%     100.000us         4.54%     100.000us       5.000us           0 b           0 b      10.00 Kb      10.00 Kb            20
              aten::copy_        21.61%     476.000us        57.97%       1.277ms      63.850us           0 b           0 b           0 b           0 b            20
          cudaMemcpyAsync         9.35%     206.000us         9.35%     206.000us      10.300us           0 b           0 b           0 b           0 b            20
    cudaStreamSynchronize        27.01%     595.000us        27.01%     595.000us      29.750us           0 b           0 b           0 b           0 b            20
                aten::add         3.99%      88.000us         8.31%     183.000us      18.300us           0 b           0 b       5.00 Kb       5.00 Kb            10
         cudaLaunchKernel         4.31%      95.000us         4.31%      95.000us       9.500us           0 b           0 b           0 b           0 b            10
    cudaDeviceSynchronize         0.09%       2.000us         0.09%       2.000us       2.000us           0 b           0 b           0 b           0 b             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.203ms

Versions

Collecting environment information...
PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.8.18 (default, Sep 11 2023, 13:40:15)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 546.33
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             20
On-line CPU(s) list:                0-19
Vendor ID:                          GenuineIntel
Model name:                         12th Gen Intel(R) Core(TM) i7-12700
CPU family:                         6
Model:                              151
Thread(s) per core:                 2
Core(s) per socket:                 10
Socket(s):                          1
Stepping:                           2
BogoMIPS:                           4223.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          480 KiB (10 instances)
L1i cache:                          320 KiB (10 instances)
L2 cache:                           12.5 MiB (10 instances)
L3 cache:                           25 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.1.0
[pip3] torch-cluster==1.6.3
[pip3] torch_geometric==2.4.0
[pip3] torch-scatter==2.1.2
[pip3] torch-sparse==0.6.18
[pip3] torch-tb-profiler==0.4.3
[pip3] torchaudio==2.1.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344
[conda] mkl-service               2.4.0            py38h5eee18b_1
[conda] mkl_fft                   1.3.8            py38h5eee18b_0
[conda] mkl_random                1.2.4            py38hdb19cb5_0
[conda] numpy                     1.24.3           py38hf6e8229_1
[conda] numpy-base                1.24.3           py38h060ed82_1
[conda] pytorch                   2.1.0           py3.8_cuda11.8_cudnn8.7.0_0    pytorch
[conda] pytorch-cluster           1.6.3           py38_torch_2.1.0_cu118    pyg
[conda] pytorch-cuda              11.8                 h7e8668a_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-scatter           2.1.2           py38_torch_2.1.0_cu118    pyg
[conda] pytorch-sparse            0.6.18          py38_torch_2.1.0_cu118    pyg
[conda] torch-geometric           2.4.0                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.3                    pypi_0    pypi
[conda] torchaudio                2.1.0                py38_cu118    pytorch
[conda] torchtriton               2.1.0                      py38    pytorch
[conda] torchvision               0.16.0               py38_cu118    pytorch

Jan 24 '24 15:01 chenlinchuang

To provide better context, i also tried the same code with legacy autograd profiler

Code to reproduce

import torch

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for _ in range(100):
        y = torch.randn(1).cuda() + torch.randn(1).cuda()
            
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Code Result

STAGE:2024-01-21 21:44:48 2136:2136 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-21 21:44:48 2136:2136 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                            aten::to         6.92%       2.056ms        44.29%      13.158ms      65.790us       2.962ms         9.35%      13.700ms      68.500us           200
                           aten::add         2.04%     605.000us        39.60%      11.764ms     117.640us      12.221ms        38.59%      12.221ms     122.210us           100
                      aten::_to_copy        13.34%       3.962ms        37.37%      11.102ms      55.510us       4.156ms        13.12%      10.738ms      53.690us           200
                         aten::randn        13.30%       3.952ms        14.89%       4.425ms      22.125us       3.134ms         9.90%       5.750ms      28.750us           200
                         aten::copy_         2.40%     714.000us        21.46%       6.376ms      31.880us       4.853ms        15.32%       4.853ms      24.265us           200
                 aten::empty_strided         1.67%     496.000us         2.46%     730.000us       3.650us       1.729ms         5.46%       1.729ms       8.645us           200
                       aten::normal_         1.35%     400.000us         1.35%     400.000us       2.000us       1.601ms         5.06%       1.601ms       8.005us           200
                         aten::empty         0.25%      73.000us         0.25%      73.000us       0.365us       1.015ms         3.20%       1.015ms       5.075us           200
    cudaDeviceGetStreamPriorityRange         1.19%     354.000us         1.19%     354.000us     354.000us       0.000us         0.00%       0.000us       0.000us             1
                  cudaGetDeviceCount         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             2
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 29.709ms
Self CUDA time total: 31.671ms

Jan 24 '24 15:01 chenlinchuang

I have the same problem.

cuda: 12.3 pytorch:2.1.2

Feb 04 '24 07:02 exitNA

The ops ProfilerStep*, aten::empty, aten::to, aten::add etc. are launched on the CPU so the profiler is working as expected when ProfilerActivity.CPU is not added. The output of the profiler is the expected behavior and not a bug.

Feb 05 '24 20:02 anupambhatnagar

kineto kineto copied to clipboard

Fail to profile CUDA activities when ProfilerActivity.CPU is not enabled

Describe the issue

Minimum code to reproduce

Code Result

Versions

Code to reproduce

Code Result

I have the same problem.

kineto
kineto copied to clipboard