roctracer icon indicating copy to clipboard operation
roctracer copied to clipboard

[Issue]: Roctracer GPU Events Have Overlapping Intervals

Open sraikund16 opened this issue 5 months ago • 2 comments

Problem Description

When running a very small Resnet50 model, I am seeing that GPU events on a single track (stream/queue) have events with overlapping time intervals. I see these issues commonly in very specific kernels such as MIOpenBatchNormBwdSpatial and batched_transpose_32x32_dword which have kind=0x11F0 and op=0. To investigate further, I created a debug branch here to see what the output of roctracer (before kineto does any processing) was returning: https://github.com/pytorch/kineto/pull/990/files

In this branch I have a debug that triggers several messages similar to the following: Out of order activity: 1886121463888334 < 1886121463888361. Difference: 27 ns. Kernel: batched_transpose_32x32_dword last Kernel: MIOpenBatchNormFwdTrainSpatialNorml which suggests that there is interval overlapping. In this branch I am only check for overlapping events for non-unknown kind events but there are also many overlappings there as well.

Thanks!

Operating System

CentOS Stream 9

CPU

AMD EPYC 7713

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.2.0

ROCm Component

roctracer

Steps to Reproduce

Run model with the kernels specified above and observe if they overlap or not

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

sraikund16 avatar Sep 19 '24 16:09 sraikund16