[Issue]: rocprofv2 get more kernel dispatches than rocprofv1
Problem Description
When use api trace on vllm inference, rocprof get less kernel dispatch records than rocprof_v2, which result tend to be correct? Possible reasons for the mismatch between kernel records of v1 and v2?
Operating System
OS: NAME="Ubuntu" VERSION="20.04.6 LTS (Focal Fossa)"
CPU
CPU: model name : Hygon C86 3380 8-core Processor
GPU
AMD Instinct MI250
ROCm Version
ROCm 5.7.1
ROCm Component
rocprofiler
Steps to Reproduce
rocprof and rocprofv2 hip-trace, kernel-trace on vllm inference app
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Hi @hgtsoi , can you provide more detailed reproduction steps? I am not sure exactly what "vllm inference app" refers to. If you are running a specific workload, can you please provide the binary (or source repo would be even better) so we can reproduce the issue internally and diagnose it.
rocprof uses a method built into HIP to trace kernels which effectively amounts to HIP reporting back to rocprof the timing of the kernels it launched. rocprofv2 using --hip-activity does the same but rocprofv2 with --kernel-trace uses a (more robust) lower-level queue interception method in the HSA library. I suspect the “extra” kernels you are seeing in rocprofv2 are via kernel tracing and those extra kernels have names starting with __amd_rocclr_. These are called BLIT kernels and HIP frequently uses them in things like the memset routines (basically just imagine a kernel which has an GPU memory address, a fill value, and a number of bytes which sets all those addresses to the fill value). IIRC, hip-trace does not self-report the BLIT kernels it uses back to rocprof.
@hgtsoi Side note: if you weren’t aware, there is a new rocprofv3 released in ROCm 6.2 as a beta, which is built on top of the new rocprofiler-sdk (also released in ROCm 6.2 as a beta).
rocprofv2 never officially made it out of the beta stage. For various reasons, we completely re-designed the underlying profiling library (rocprofiler-sdk) and rocprofv3 from scratch.
I’d strongly suggest using rocprofv3 over rocprofv2 at this point. rocprofv3 is very close to having feature parity, has a lower overhead than v1 and v2, and is significantly better tested.
@hgtsoi closing this ticket due to inactivity. Feel free to reopen it if you still need help.