MangoHud GPU Usage stuck at 0-1%

When launching an application, the GPU usage will briefly spike up to >100%, then get stuck at 0-1% when other reporting applications report a higher percentage.

This issue does not reproduce on MangoHud 0.6.6.

Using Vulkan Examples' pbribl example:

radeontop -c:

nvtop:

sudo umr -O use_color -t:

hexdump -C /sys/class/drm/card*/device/gpu_metrics:

00000000  80 00 02 02 b5 18 57 17  12 16 89 17 5d 16 25 17  |......W.....].%.|
00000010  12 16 bb 17 12 16 d4 17  ed 17 8f 16 47 22 00 00  |............G"..|
00000020  70 d2 24 1e d1 03 00 00  12 00 64 34 43 03 ff ff  |p.$.......d4C...|
00000030  87 00 21 00 01 04 07 00  20 00 fc 00 ef 00 63 00  |..!..... .....c.|
00000040  d1 05 bd 01 ff ff 20 03  90 01 ff ff cf 05 bd 01  |...... .........|
00000050  06 00 20 03 90 01 90 01  78 05 78 05 78 05 78 05  |.. .....x.x.x.x.|
00000060  78 05 78 05 78 05 78 05  78 05 78 05 06 00 00 00  |x.x.x.x.x.x.....|
00000070  00 00 ff ff ff ff ff ff  60 00 00 00 00 00 00 00  |........`.......|
00000080

hexdump -C /sys/class/drm/card0/device/gpu_busy_percent:

00000000  38 37 0a                                          |87.|
00000003

Side-by-Side:

System Information:

CPU: 16x AMD Ryzen 7 PRO 4750U with Radeon Graphics
RAM: 32GB 3200MHz DDR4
GPU: AMD RENOIR (LLVM 13.0.1, DRM 3.44, 5.17.5-arch1-1) / AMD RADV RENOIR
Kernel Version: 5.17.5-arch1-1
Driver: Mesa 22.0.2
MangoHud Version: 0.6.7

May 05 '22 03:05 Derppening

I also ran a git bisect between 5349226 and v0.6.7. Logs are attached below:

git bisect start
# good: [5349226fa50f98c7d3328258112f48865b96cddb] amdgpu: average load over .5s
git bisect good 5349226fa50f98c7d3328258112f48865b96cddb
# bad: [663bbd05a60c7d1e3fd352fdd8c55e96bd8af0f2] Bump to 0.6.7
git bisect bad 663bbd05a60c7d1e3fd352fdd8c55e96bd8af0f2
# bad: [f9cfdeb0804779a9957bcf956b9dfc63956a23b4] Add gpu throttling status
git bisect bad f9cfdeb0804779a9957bcf956b9dfc63956a23b4
# good: [350dca5d2196c166d090fc783a6d8da607fe789e] Dynamic width when fps_only
git bisect good 350dca5d2196c166d090fc783a6d8da607fe789e
# bad: [ae85730448f3ac7c895e5669f48aab032abb3040] Improve amdgpu polling
git bisect bad ae85730448f3ac7c895e5669f48aab032abb3040
# first bad commit: [ae85730448f3ac7c895e5669f48aab032abb3040] Improve amdgpu polling

(I started with 5349226 because previous commits were broken by #731)

May 05 '22 04:05 Derppening

Can you apply this patch to latest and post the terminal output here?

index 911d931..c7535b0 100644
--- a/src/amdgpu.cpp
+++ b/src/amdgpu.cpp
@@ -148,6 +148,7 @@ void amdgpu_metrics_polling_thread() {
 
 			// Detect and fix if the gpu load is reported in centipercent
 			if (gpu_load_needs_dividing || metrics_buffer[cur_sample_id].gpu_load_percent > 100){
+				printf("AMDGPU load assuming centipercent because we recieved: %i\n", metrics_buffer[cur_sample_id].gpu_load_percent);
 				gpu_load_needs_dividing = true;
 				metrics_buffer[cur_sample_id].gpu_load_percent /= 100;
 			}

May 05 '22 07:05 flightlessmango

Here are the logs over a 30-second-ish time period:

mangohud.log

May 05 '22 13:05 Derppening

I believe this patch should fix the issue, can you confirm?

index 911d931..f2f035f 100644
--- a/src/amdgpu.cpp
+++ b/src/amdgpu.cpp
@@ -16,8 +16,8 @@ std::string metrics_path = "";
  */
 struct amdgpu_common_metrics {
 	/* Load level: averaged across the sampling period */
-	uint8_t gpu_load_percent;
-	// uint8_t mem_load_percent;
+	uint16_t gpu_load_percent;
+	// uint16_t mem_load_percent;
 
 	/* Power usage: averaged across the sampling period */
 	float average_gfx_power_w;

May 05 '22 14:05 flightlessmango

can confirm this is happening to me

May 05 '22 22:05 Etaash-mathamsetty

I believe this patch should fix the issue, can you confirm?

index 911d931..f2f035f 100644
--- a/src/amdgpu.cpp
+++ b/src/amdgpu.cpp
@@ -16,8 +16,8 @@ std::string metrics_path = "";
  */
 struct amdgpu_common_metrics {
 	/* Load level: averaged across the sampling period */
-	uint8_t gpu_load_percent;
-	// uint8_t mem_load_percent;
+	uint16_t gpu_load_percent;
+	// uint16_t mem_load_percent;
 
 	/* Power usage: averaged across the sampling period */
 	float average_gfx_power_w;

Yes, this patch indeed fixes the issue.

Sidenote: I still observe that when first launching the application, the GPU load percentage will go from 0 -> 3000 -> 0 -> 50 -> 75 -> ... until it reaches the actual GPU load percentage (~87% for the demo). Should I file another bug report for that issue?

May 06 '22 03:05 Derppening

I've pushed a new commit that should address that issue as well, can you confirm?

May 06 '22 08:05 flightlessmango

~~Yes, I can confirm that this issue is fixed. Thanks!~~

Edit: I realized I tested this on a system which doesn't have this issue, so I will need a bit more time to try it again. Sorry!

May 06 '22 15:05 Derppening

ok I just tried it, and it works fine I didn't extensively test it, just loaded into the rocket league main screen and saw gpu usage at around 100% with the new patch, vs 0% with the old one

May 06 '22 19:05 Etaash-mathamsetty

I've pushed a new commit that should address that issue as well, can you confirm?

I can confirm that the incorrect GPU load and initial GPU load spike issue is fixed. Thanks!

Sidenote (again): I see that MangoHud takes around 5-8 (increasing) measurements before it will stabilize at the current GPU load. Is that normal?

May 07 '22 13:05 Derppening

one measurement is every 0.5 seconds? I just see one incorrect value and then it's correct

May 11 '22 14:05 flightlessmango

What I mean is, it will gradually ramp up like: 0, 49, 58, 72, 79, 83, 86, 87, 88 before it stays at ~88% utilization. In other word, it appears to take around 3 seconds while the GPU utilization metric ramps up to the expected percentage.

May 18 '22 09:05 Derppening

MangoHud MangoHud copied to clipboard

GPU Usage stuck at 0-1%

MangoHud
MangoHud copied to clipboard