Enabling Detailed Profiling of Graph Nodes in OmniTrace
Hi, I am currently working on profiling VLLM and I observed that the tool captures the execution of graph kernels at a high level but does not provide detailed insights into individual graph nodes' execution.
My goal is to obtain detailed profiling information on the execution of individual graph nodes, similar to the capabilities offered by Nvidia Nsight, which allows for tracking nodes instead of just graph-level execution.
I am seeking guidance or a workaround to enable detailed profiling of graph nodes within OmniTrace. Any insights or configuration options?
here is the command I use:
omnitrace-run -c ~/.omnitrace.cfg --enable-categories device-critical-trace device_busy device_hip device_hsa device_memory_usage python rocm_hip rocm_hsa rocm_smi rocprofiler roctracer --roctracer-hip-activity --roctracer-hip-api --roctracer-hsa-activity --roctracer-hsa-api -- python -m omnitrace -- vllm_benchmark.py
Thanks in advance.
Given that the arrows flow from the API functions to multiple kernels, it appears that you are indeed getting the individual graph node execution. The --roctracer-hsa-activity option that you have enables that. You might want to remove the --hip-device-activity option bc that is the “high-level” kernel tracing option and doing both simultaneously might be doing funny things with the connection of the flow events and could also contribute to why none of the kernel function names are getting resolved beyond “Kernel Execution”.
Hi @OmarSayedMostafa. Do you still need assistance with this ticket? If not, please close the ticket. Thanks!
Hi @OmarSayedMostafa. Closing ticket due to lack of activity. Please feel free to re-open ticket if you still need assistance with the ticket. Thanks!