kineto
kineto copied to clipboard
Fail to profile CUDA activities when ProfilerActivity.CPU is not enabled
Describe the issue
Profiler fails to profile CUDA, and only CPU time is printed as the result. When CUDA is set as the only profiled activity, the following message is produced:
Warning: Failed to recover relationship between all profiler and kineto events: 100 vs. 0 reassociated. (function reassociate)
Minimum code to reproduce
- Case 1: set activities to only ProfilerActivity.CUDA
import torch
from torch.profiler import profile, record_function, ProfilerActivity
prof = profile(
activities=[ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=10, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/model'),
profile_memory=True,
record_shapes=True,
with_stack=True)
prof.start()
for _ in range(100):
y = torch.randn(1).cuda() + torch.randn(1).cuda()
prof.step()
print(prof.key_averages())
prof.stop()
- Case 2: set activities to [ProfilerActivity.CPU, ProfilerActivity.CUDA]
prof = profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
...
Code Result
- Case 1
STAGE:2024-01-21 17:27:31 1356:1356 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-21 17:27:31 1356:1356 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-21 17:27:31 1356:1356 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
[W collection.cpp:929] Warning: Failed to recover relationship between all profiler and kineto events: 100 vs. 0 reassociated. (function reassociate)
------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us 0 b 0 b 100
cudaMemcpyAsync 23.86% 172.000us 23.86% 172.000us 8.600us 0 b 0 b 20
cudaStreamSynchronize 61.17% 441.000us 61.17% 441.000us 22.050us 0 b 0 b 20
cudaLaunchKernel 14.70% 106.000us 14.70% 106.000us 10.600us 0 b 0 b 10
cudaDeviceSynchronize 0.28% 2.000us 0.28% 2.000us 2.000us 0 b 0 b 1
------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 721.000us
- Case 2
STAGE:2024-01-21 17:29:55 1376:1376 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-21 17:29:55 1376:1376 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-21 17:29:55 1376:1376 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
ProfilerStep* 6.63% 146.000us 99.91% 2.201ms 220.100us 0 b -80 b 0 b -17.00 Kb 10
aten::randn 3.31% 73.000us 5.90% 130.000us 6.500us 80 b 8 b 0 b 0 b 20
aten::empty 1.68% 37.000us 1.68% 37.000us 1.850us 72 b 72 b 0 b 0 b 20
aten::normal_ 1.50% 33.000us 1.50% 33.000us 1.650us 0 b 0 b 0 b 0 b 20
aten::to 12.98% 286.000us 67.82% 1.494ms 74.700us 0 b 0 b 10.00 Kb 2.00 Kb 20
aten::_to_copy 3.00% 66.000us 65.50% 1.443ms 72.150us 0 b 0 b 10.00 Kb 0 b 20
aten::empty_strided 4.54% 100.000us 4.54% 100.000us 5.000us 0 b 0 b 10.00 Kb 10.00 Kb 20
aten::copy_ 21.61% 476.000us 57.97% 1.277ms 63.850us 0 b 0 b 0 b 0 b 20
cudaMemcpyAsync 9.35% 206.000us 9.35% 206.000us 10.300us 0 b 0 b 0 b 0 b 20
cudaStreamSynchronize 27.01% 595.000us 27.01% 595.000us 29.750us 0 b 0 b 0 b 0 b 20
aten::add 3.99% 88.000us 8.31% 183.000us 18.300us 0 b 0 b 5.00 Kb 5.00 Kb 10
cudaLaunchKernel 4.31% 95.000us 4.31% 95.000us 9.500us 0 b 0 b 0 b 0 b 10
cudaDeviceSynchronize 0.09% 2.000us 0.09% 2.000us 2.000us 0 b 0 b 0 b 0 b 1
------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.203ms
Versions
Collecting environment information...
PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 546.33
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i7-12700
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 2
BogoMIPS: 4223.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 480 KiB (10 instances)
L1i cache: 320 KiB (10 instances)
L2 cache: 12.5 MiB (10 instances)
L3 cache: 25 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.1.0
[pip3] torch-cluster==1.6.3
[pip3] torch_geometric==2.4.0
[pip3] torch-scatter==2.1.2
[pip3] torch-sparse==0.6.18
[pip3] torch-tb-profiler==0.4.3
[pip3] torchaudio==2.1.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py38h5eee18b_1
[conda] mkl_fft 1.3.8 py38h5eee18b_0
[conda] mkl_random 1.2.4 py38hdb19cb5_0
[conda] numpy 1.24.3 py38hf6e8229_1
[conda] numpy-base 1.24.3 py38h060ed82_1
[conda] pytorch 2.1.0 py3.8_cuda11.8_cudnn8.7.0_0 pytorch
[conda] pytorch-cluster 1.6.3 py38_torch_2.1.0_cu118 pyg
[conda] pytorch-cuda 11.8 h7e8668a_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-scatter 2.1.2 py38_torch_2.1.0_cu118 pyg
[conda] pytorch-sparse 0.6.18 py38_torch_2.1.0_cu118 pyg
[conda] torch-geometric 2.4.0 pypi_0 pypi
[conda] torch-tb-profiler 0.4.3 pypi_0 pypi
[conda] torchaudio 2.1.0 py38_cu118 pytorch
[conda] torchtriton 2.1.0 py38 pytorch
[conda] torchvision 0.16.0 py38_cu118 pytorch
To provide better context, i also tried the same code with legacy autograd profiler
Code to reproduce
import torch
with torch.autograd.profiler.profile(use_cuda=True) as prof:
for _ in range(100):
y = torch.randn(1).cuda() + torch.randn(1).cuda()
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Code Result
STAGE:2024-01-21 21:44:48 2136:2136 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-21 21:44:48 2136:2136 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::to 6.92% 2.056ms 44.29% 13.158ms 65.790us 2.962ms 9.35% 13.700ms 68.500us 200
aten::add 2.04% 605.000us 39.60% 11.764ms 117.640us 12.221ms 38.59% 12.221ms 122.210us 100
aten::_to_copy 13.34% 3.962ms 37.37% 11.102ms 55.510us 4.156ms 13.12% 10.738ms 53.690us 200
aten::randn 13.30% 3.952ms 14.89% 4.425ms 22.125us 3.134ms 9.90% 5.750ms 28.750us 200
aten::copy_ 2.40% 714.000us 21.46% 6.376ms 31.880us 4.853ms 15.32% 4.853ms 24.265us 200
aten::empty_strided 1.67% 496.000us 2.46% 730.000us 3.650us 1.729ms 5.46% 1.729ms 8.645us 200
aten::normal_ 1.35% 400.000us 1.35% 400.000us 2.000us 1.601ms 5.06% 1.601ms 8.005us 200
aten::empty 0.25% 73.000us 0.25% 73.000us 0.365us 1.015ms 3.20% 1.015ms 5.075us 200
cudaDeviceGetStreamPriorityRange 1.19% 354.000us 1.19% 354.000us 354.000us 0.000us 0.00% 0.000us 0.000us 1
cudaGetDeviceCount 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 2
------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 29.709ms
Self CUDA time total: 31.671ms
I have the same problem.
cuda: 12.3 pytorch:2.1.2
The ops ProfilerStep*
, aten::empty
, aten::to
, aten::add
etc. are launched on the CPU so the profiler is working as expected when ProfilerActivity.CPU
is not added. The output of the profiler is the expected behavior and not a bug.