ROCK-Kernel-Driver
ROCK-Kernel-Driver copied to clipboard
[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3
Followup to https://github.com/RadeonOpenCompute/ROCm/issues/2625 since further debugging revealed it to be an amdgpu/amdkfd driver issue.
When running llama.cpp's server example on ROCm, using an RDNA3 GPU, GPU usage is shown as 100% and a high power consumption is measured at the wall outlet, even with the server at idle.
Investigating further, it seems that the issue is related to HIP stream usage: GPU usage first shoots up to persistent 100% when llama.cpp tries to create its second HIP stream. If I limit llama.cpp to use only a single stream, then GPU load behaves normally until it begins writing into GPU memory using hipMemcpy or hipMemset, at which point it permanently (EDIT: for the lifetime of the process that performed the hipMemcpy) jumps up to 100%, and stays there until llama.cpp is closed.
In minimal testcases, the following scenarios all yielded 100% GPU usage, despite never actually executing any user code on the GPU:
- Creating a HIP stream while another HIP stream is open. Once triggered, closing the HIP streams doesn't help. (If I open a stream and close it, then open another one, with no overlap in time between the 2 streams, the issue isn't seen.)
- Writing to GPU memory while a HIP stream is open. Once triggered, neither closing the HIP stream nor deallocating the memory previously written will cause the GPU load to come down, only killing the process helps. (If I close the stream before writing to GPU memory, the issue isn't seen, even if that memory was allocated before or during the stream's lifetime.)
- Creating a HIP stream after any GPU memory write has taken place, even if the previously written memory is freed before the stream is created. Once triggered, closing the HIP stream doesn't help.
The issue is reproducible with the latest code in this repository, using ROCm 5.7.1 on a Radeon RX 7900 XT, and also on a Radeon RX 7800 XT. Setting the module option "sched_policy=2" seems to be a viable workaround, at the cost of slightly higher power consumption when the GPU is fully idle, and the local ttyX consoles becoming laggy. ("sched_policy=1" didn't help.)
Debugging this further, it seems that excessive power usage starts when the offending operation (memory write or stream creation) creates a new HW queue. On RDNA3, this always uses MES, even when mes=0 is specified in the module parameters.
Within the MES code, mes_v11_0_add_hw_queue then calls mes_v11_0_submit_pkt_and_poll_completion, which calls amdgpu_ring_commit. As soon as amdgpu_ring_commit returns, GPU usage spikes to 100%, and remains there, using about 100W of excess power.
Minimal testcases are available in https://github.com/RadeonOpenCompute/ROCm/issues/2625.
Thanks for your report. We've got someone setting up a system internally to reproduce the issue and try to isolate where this GPU usage bug is coming from.
Are there any updates? Were you successful in reproducing this?
I'll try to get an update from the dev who was assigned to repro it.
EDIT: The dev assigned to it was a bit backlogged but he's getting on reproducing it this week. Thanks for your patience!
So our dev couldn't repro it with the following config: OS: Ubuntu 22.04 6.2.0-37-generic GPU: Radeon PRO W7900 (this is a Navi31 GPU) DRIVER: ROCm 5.7.1 6.2.4-1664922.22.04
He saw the 100% spikes but the usage went down again afterwards. Can you get the VBIOS for your card? rocm-smi --showvbios should be sufficient, or "dmesg|grep ATOM"
The 100% GPU usage does go back down if the test program exits completely - but only then; this is the issue. It can be made more visible by extending the delay in my minimal testcases at the end of execution to e.g. a minute, which will cause 1 minute of inappropriate 100% GPU load.
rocm-smi --showvbios from system 1 (RX 7900 XT):
========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0] : VBIOS version: 113-D70401-00
====================================================================================
=============================== End of ROCm SMI Log ================================
rocm-smi --showvbios from system 2 (RX 7800 XT):
========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0] : VBIOS version: 113-APM6767CL-100
====================================================================================
=============================== End of ROCm SMI Log ================================
Any updates on this? I am observing the same issue with llama.cpp on my RX 7900 XTX
I'm adding my rocm-smi --showvbios
as well. Maybe it helps.
rocm-smi
GPU0 is the 7900 XTX. GPU1 is a 6750 XT (there is no issue with this one)========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0] : VBIOS version: 113-3E4710U-O4X
GPU[1] : VBIOS version: 113-67KA6SHD1-X01
====================================================================================
=============================== End of ROCm SMI Log ================================
So our dev couldn't repro it with the following config:
He saw the 100% spikes but the usage went down again afterwards.
So it sounds like you can reproduce it? I can also reproduce this issue with a newer kernel on arch linux (Linux 6.6.1-arch1-1
) on a 7900 XT (VBIOS version: 113-D70401-00
).
Just to be clear the issue is that calling hipStreamCreate
shouldn't max out the GPU usage. Even after calling hipStreamDestroy
the max usage continues. The only thing that allows the GPU to return to idle is closing the program.
I can reproduce this as well, ASRock 7900 XTX (ref design).
I've used repro1.cpp
from the other thread, and I can clearly see in rocm-smi
that the GPU usage remains at 100% after the program outputs HIP stream destroyed, waiting 5 more secods
. Only when the program finishes, the GPU usage goes down again.
rocm-smi --showvbios
======================= ROCm System Management Interface =======================
==================================== VBIOS =====================================
GPU[0] : VBIOS version: 113-D7020100-102
================================================================================
============================= End of ROCm SMI Log ==============================
I now also tried the minimal testcases with both of my GPUs (6750 XT, 7900 XTX) in the system and only with the 7900 XTX.
I was not really able to reproduce the issue with the minimal testcase when both GPUs are in the system. @Googulator I tried HIP_VISIBLE_DEVICES=<gpu id>
to select the device. I'm not sure if that works with the minimal example.
When I removed the 6750 XT I was able to reproduce the issue with repro1.cpp
Additionally I observed that sometimes (I could not reproduce it reliably) the GPU usage stays at 100% even after the process exits. Only after a reboot the usage goes back down.
@Ori-Messinger Any progress on this one? I know you've got a bunch of issues to repro ATM
I also face the same issue on Linux archlinux-pc 6.1.67-1-lts #1 SMP PREEMPT_DYNAMIC Mon, 11 Dec 2023 12:58:39 +0000 x86_64 GNU/Linux
with
>>> rocm-smi --showvbios
========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0] : VBIOS version: 115-C994PI0-102
====================================================================================
=============================== End of ROCm SMI Log ================================
>>> rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 3 2200G with Radeon Vega Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 3 2200G with Radeon Vega Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 0
Compute Unit: 4
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 16305612(0xf8cdcc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16305612(0xf8cdcc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16305612(0xf8cdcc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx803
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 560 Series
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26607(0x67ef)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1176
BDFID: 256
Internal Node ID: 1
Compute Unit: 14
SIMDs per CU: 4
Shader Engines: 2
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 730
SDMA engine uCode:: 58
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 4194304(0x400000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 4194304(0x400000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
The problem disappears when using linux-lts
older kernel.
Also, diff between rocminfo
output on different kernels looks like this:
11c11
< DMAbuf Support: YES
---
> DMAbuf Support: NO
49c49
< Size: 16305612(0xf8cdcc) KB
---
> Size: 16306580(0xf8d194) KB
56c56
< Size: 16305612(0xf8cdcc) KB
---
> Size: 16306580(0xf8d194) KB
63c63
< Size: 16305612(0xf8cdcc) KB
---
> Size: 16306580(0xf8d194) KB
llama.cpp sitting idling ~140W on 7900 XTX ... this is unacceptable .
As a side note, power limiting overall is way too convoluted (couldn't even set it up) compared to nvidia, where you just do -pl <watts>
.
This also affects W7900 cards on ROCm 6.0, kernel 6.6.8. A solution is desirable to avoid burning electricity and creating heat for no reason.
99W draw with nothing in VRAM and inference not running. radeontop
reports Graphics Pipe, Clip Rectangle and Shader Clock at 100%.
EDIT: Filed https://gitlab.freedesktop.org/drm/amd/-/issues/3080 to ensure this is also on the radar there.
I've found that at least for llama.cpp, setting GPU_MAX_HW_QUEUES=1
in the environment works around this issue with no clear performance impact, but substantial power/thermal budget improvement. I still think this seems like a major issue that should be fixed without obscure environment variables.
Is this commit intended to be a fix for this issue?
https://gitlab.freedesktop.org/agd5f/linux/-/commit/7e505b272c7adb68c5353944eda4befb95e83935
I haven't been able to find it in this repository, only on the Freedesktop one.
We're not sure that the patch will fix it, but the patch missed the ROCm 6.0 cutoff, hence it not being here yet. It'll be in ROCm 6.1 (unless they pick a really weird branching point). It can also be manually applied by editing the file in /usr/src/ and then rebuilding via dkms, if you wanted to give it a shot
What is the actual development repository for the kernel driver then, if not this one? The freedesktop one appears to be a staging repository for patches ready for upstreaming, not actual development.
We're working on that currently.
Right now, the upstream repo (maintained by Alex Deucher) is for the upstream kernel, which is where most of our patches come from. The DKMS code (which is exposed here) is not upstreamable, so we've got KCL (Kernel Compatibility Layer), IPC and RDMA here but not upstream. What happens is that the patches going into amd-staging-drm-next are picked over to this DKMS-supported branch by the KCL team. They adapt all patches to work on the various kernels that we support. Then that internal branch is what's used for ROCm releases, so we can support more OSes than just Ubuntu (or whatever distro supports the latest upstream kernel)
Currently we (more accurately, I) just update the master branch to reflect the latest ROCm release branch at release time, with no real develop branch to speak of. This is primarily because development on the DKMS branch is still done internally. I'm working on seeing if we can try to at least try to mirror the internal mainline branch here as well on some sort of weekly cadence, instead of just dealing with updates at release time, but the process is taking a while. Lots of hoops to jump through.
@65a Thank, It works.
We've got an RLC FW fix coming in ROCm 6.2 that should also work to address this issue.