[Unitrace] too many very long calls to zeCommandListAppendMemory fill on Lunar Lake
Description
I am trying to profile/trace an application that uses GPU on Lunar Lake. The trace I obtain is unreadable due to very long zeCommandListAppendMemoryFill, they look like they are taking 10 minutes, which is way more than the whole application takes. This doesn't happen on Intel ARC B580
An example is:
Environment
>cat /etc/issue
Ubuntu 24.10
>dpkg -l | grep intel
ii intel-fw-gpu 2024.37.5-362~22.04 all Firmware package for Intel integrated and discrete GPUs\
ii intel-gpu-tools 1.29-1 amd64 tools for debugging the Intel graphics driver
ii intel-igc-cm 1.0.225.54083-1077~24.04 amd64 Intel(R) C for Metal Compiler -- CM Frontend lib
ii intel-media-va-driver-non-free:amd64 25.1.0-0ubuntu1~ppa1 amd64 VAAPI driver for the Intel GEN8+ Graphics family
ii intel-metrics-discovery 1.13.179-1077~24.04 amd64 Intel(R) Metrics Discovery Application Programming Interface --
ii intel-metrics-library 1.0.182-1077~24.04 amd64 Intel(R) Metrics Library for MDAPI (Intel(R) Metrics Discovery
ii intel-microcode 3.20250211.0ubuntu0.24.10.1 amd64 Processor microcode firmware for Intel CPUs
ii intel-ocloc 24.52.32224.14-1077~24.04 amd64 Tool for managing Intel Compute GPU device binary format
ii intel-opencl-icd 24.52.32224.14-1077~24.04 amd64 Intel graphics compute runtime for OpenCL
ii libchewing3:amd64 0.9.0-1 amd64 intelligent phonetic input method library
ii libchewing3-data 0.9.0-1 all intelligent phonetic input method library - data files
ii libdrm-intel1:amd64 2.4.122-1 amd64 Userspace interface to intel-specific kernel DRM services -- runtime
ii libze-intel-gpu1 24.52.32224.14-1077~24.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii xserver-xorg-video-intel 2:2.99.917+git20210115-1build1 amd64 X.Org X server -- Intel i8xx, i9xx display driver
> sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) Graphics 20.4.4 [1.6.32224+14]
@s-Nick Thank you for reporting this issue. We are triaging it now,
@s-Nick The device timestamps are incorrect on LNL. This issue does not reproduce on Linux or Windows + Arrow Lake.
.
I've confirmed the workaround provided in https://github.com/intel/pti-gpu/pull/93 helps a lot to solve this issue. Some functions can still appear with a much longer time than possible but re-running unitrace or increase the MAX_RETRY from the workaround solves the issue.