roctracer
roctracer copied to clipboard
[Issue]: Roctracer GPU Events Have Overlapping Intervals
Problem Description
When running a very small Resnet50 model, I am seeing that GPU events on a single track (stream/queue) have events with overlapping time intervals. I see these issues commonly in very specific kernels such as MIOpenBatchNormBwdSpatial and batched_transpose_32x32_dword which have kind=0x11F0 and op=0. To investigate further, I created a debug branch here to see what the output of roctracer (before kineto does any processing) was returning: https://github.com/pytorch/kineto/pull/990/files
In this branch I have a debug that triggers several messages similar to the following:
Out of order activity: 1886121463888334 < 1886121463888361. Difference: 27 ns. Kernel: batched_transpose_32x32_dword last Kernel: MIOpenBatchNormFwdTrainSpatialNorml
which suggests that there is interval overlapping. In this branch I am only check for overlapping events for non-unknown kind events but there are also many overlappings there as well.
Thanks!
Operating System
CentOS Stream 9
CPU
AMD EPYC 7713
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.0
ROCm Component
roctracer
Steps to Reproduce
Run model with the kernels specified above and observe if they overlap or not
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response