level-zero-tests
level-zero-tests copied to clipboard
"ze_peak" freezes on DG1 with latest drm-tip kernel + drivers
Setup:
- HW: CML-S / DG1 (0x4905)
- OS: Ubuntu 22.04
- Kernel: "drm-tip" head from yesterday
- UMD: Latest releases of compute stack components, built with LLVM 12
- App: "ze_peak" from level-zero-tests head
Bug:
./ze_peak
freezes with 99% CPU usage after showing:
Single Precision Compute (GFLOPS)
(I.e. half precision and global BW tests before it worked fine.)
It can be quit with ^C, so it's not in 100% CPU loop.
Gdb shows:
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container.
0x00007f6fbca28cab in sched_yield () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f6fbca28cab in sched_yield () from target:/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f6fbc27cd63 in ?? () from target:/usr/local/lib/libze_intel_gpu.so.1
#2 0x00007f6fbc0572c2 in ?? () from target:/usr/local/lib/libze_intel_gpu.so.1
#3 0x0000564de2c87d3f in ?? ()
#4 0x0000564de2c88653 in ?? ()
#5 0x0000564de2c94ba1 in ?? ()
#6 0x0000564de2c86104 in ?? ()
#7 0x00007f6fbc949d90 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
#8 0x00007f6fbc949e40 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#9 0x0000564de2c862e5 in ?? ()
perf
showed most of the time being spent inside libze_intel_gpu.so.1
. I.e. it could be driver issue, but I thought it better to start from the app.
ze_image_copy
, ze_nano
and ze_pingpong
work fine. ze_bandwidth
gets slower and slower, and I did not wait for it to complete.
@eero-t : could you check if it is just that it is taking a long time? please execute with reduced number of iterations
-i 5
With -i 2
, "Global memory bandwidth" numbers were output at 1s interval, "Half Precision Compute" numbers at 2s interval, "Single Precision Compute" numbers at 5-10s interval, "Integer Compute" numbers at 10- 20s interval.
In total it took 3.5 mins with -i 2
, and 4.4 mins with -i 5
.
What's the default iteration count? With that, I see this in dmesg:
[271298.886789] Fence expiration time out i915-0000:03:00.0:ze_peak[226719]:788!
[271298.887157] Fence expiration time out i915-0000:03:00.0:ze_peak[226719]:786!
Which may explain why it freezes.
With default iteration count, there are no numbers shown for "Single Precision Compute" even after 40 mins, so I think that test is really frozen. Especially as numbers for the two earlier categories came only with few second delays.
Benchmarks code may be missing some error checks and warnings for the errors (when to skip given thing).
As to ze_bandwidth
, that finished in a bit over 4 mins with its default options, so it is fine.
PS. why both of these GPU benchmarking programs take constantly 100% CPU, and need to allocate 32TB of virtual memory?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
226471 root 20 0 32,0t 55040 40752 R 100,3 0,3 1:56.89 ze_bandwidth
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
226719 root 20 0 32,0t 2,0g 71788 R 100,0 12,7 0:13.69 ze_peak
Latest ze_peak
is still freezing in "Single Precision" test with following stack:
- kernel: drm-tip 6.1.0-rc5
- GuC FW: 70.5.1
- GMMlib: intel-gmmlib-22.3.1
- SPIRV-SDK: sdk-1.3.231.1/sdk-1.3.231.1 (headers/tools)
- SPIRV-LLVM: libllvmspirvlib-12-dev:amd64:12.0.0-3 (Ubuntu package)
- OpenCL-Clang: libopencl-clang-12-dev:amd64:12.0.0-3 (Ubuntu package)
- VC-intrinsics: v0.9.0
- Graphics Compiler: igc-1.0.12662.1 (IGC)
- Level-Zero API: v1.8.8
- compute-runtime: 22.43.24558
There are again these kernel driver warnings:
[859809.534534] Fence expiration time out i915-0000:03:00.0:ze_peak[438677]:788!
[859809.534952] Fence expiration time out i915-0000:03:00.0:ze_peak[438677]:786!
strace -f -p $(pidof ze_peak)
shows it doing nothing but sched_yield()
system calls.
perf
shows its 100% CPU usage going to:
Overhead Command Shared Object Symbol
7,68% ze_peak libze_intel_gpu.so.1.3.0 [.] 0x000000000023d714
6,44% swapper [kernel.kallsyms] [k] mwait_idle_with_hints.constprop.0
3,17% ze_peak [kernel.kallsyms] [k] check_preemption_disabled
3,12% ze_peak [kernel.kallsyms] [k] preempt_count_add
2,82% ze_peak [kernel.kallsyms] [k] __schedule
2,66% ze_peak libc.so.6 [.] _help
2,56% ze_peak [vdso] [.] __vdso_clock_gettime
2,17% ze_peak [kernel.kallsyms] [k] _raw_spin_lock
2,04% ze_peak [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
1,91% ze_peak [kernel.kallsyms] [k] update_curr
1,83% ze_peak [kernel.kallsyms] [k] preempt_count_sub
1,83% ze_peak [kernel.kallsyms] [k] pick_next_task_fair
1,46% ze_peak [kernel.kallsyms] [k] sched_clock
1,44% ze_peak [kernel.kallsyms] [k] rcu_note_context_switch
1,39% ze_peak [kernel.kallsyms] [k] __entry_text_start