powerupp
powerupp copied to clipboard
6800 crashing after applying DPM3 settings
After getting upp working I ended up finding this project and branch.
I am trying this with https://aur.archlinux.org/packages/powerupp-git/ by modifying the branch to bignavi.
The GUI opens up just fine and I can load active settings. But if I click "Apply current" I get following errors from kernel, and the settings don't appear to change. (reading from /sys/kernel/debug/dri/$index/amdgpu_pm_info)
Dec 20 19:06:26 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000034, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
Dec 20 19:06:26 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
Dec 20 19:06:26 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: use vbios provided pptable
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: failed send message: TransferTableDram2Smu (19) param: 0x00000000 response 0xffffffc2
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to transfer pptable to SMC!
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to setup smc hw!
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: smu reset failed, ret = -62
If I launch a game, I get a freeze fairly quickly.
Kernel in use
Linux quasd 5.9.14-1-ck #1 SMP PREEMPT Fri, 18 Dec 2020 06:58:44 +0000 x86_64 GNU/Linux
linux-firmware in use
linux-firmware-git 20201130.7455a36-1
Is this a problem of running on 5.9?
I also tried lowering the mem clock by 1 mhz which resulted in following
Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000034, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: use vbios provided pptable
Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is initialized successfully!
Probably not big enough change to trigger a change.
Interesting with some 6800 test! I am not the maintainer of the Arch AUR package, which is a bit outdated, and support for the 6000 series is only available in this experimental branch. It also requires the latest version of UPP (not available at pip yet).
If you have already installed UPP via pip you could do a quick-and-hacky update by overwriting the files in the upp lib folder (for example ~/.local/lib/python3.8/site-packages/upp
) with the latest files from github, otherwise install it from source and make sure the upp
command is available and runs the latest version. For powerupp download the bignavi branch and do make && sudo make install
.
Edit: sorry, I didn't read you initial post properly, is UPP also from the current git repo?
Hello
and also did bit more testing. Only the mem settings seem to be the problem. I was able to use the static voltage / Graphics card power ( seem to be limited to 257, more testing to do ) and Gfx clock frequency.
Sounds promising! I don't know if you attempted to set the DPM 0 frequency, that will likely cause the GPU to freak out. I would expect at least the DPM 3 frequency to be adjustable though.
I was doing it to DPM 3.
Here is how it looks to me.
Notes this far:
- Graphic card power seems to be bugged, as long as I set it to maximum allowed, I can keep increasing it. I think the default is 233, and in the screenshot it is 439.
- Can't increase gpu mhz past 2475 ( will drop down to 2d clocks )
- With above settings I am pretty much locked to 2450 mhz
is UPP also from the current git repo?
Technically it's from my own branch (based on master), only added one comma thought.
More testing done and things seem promising. I was able to confirm that the oc works.
- Just simple testing and staring at the wall. 10fps increase from stock.
- The power limit seems to be still locked to 233 even though I can increase it endlessly
- Biggest limitation is Gfx clock frequency, if this can't be changed there won't be much point putting this card on water :(
I guess the next step for my oc would be to flash a 6800 XT bios or something.
Anything you would like me to test?
Many thanks for the report! As for the power limit it's been tricky to set with the navi 10 cards as well, only working well with certain firmware/kernel version combinations. Are the changes that you make reflected in cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
?
Gfx clock is a pity if it's not possible to adjust, it is unfortunately not unlikely that the card is "hard limited" like the 5600 XT was.
Many thanks for the report! As for the power limit it's been tricky to set with the navi 10 cards as well, only working well with certain firmware/kernel version combinations. Are the changes that you make reflected in
cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
?
it is reflected there too.
cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
360000000
And regarding the Memory settings, I tried again. It still crashes after few seconds without changing any settings. ( at least what powerupp and /sys/kernel/debug/dri/$index/amdgpu_pm_info tell me) I tried both lowering and increasing the DPM3 Clock frequency, both seem to make this occur.
Can you try to set upp set smc_pptable/DcModeMaxFreq/0=2500 smc_pptable/DcModeMaxFreq/2=1050 --write
and see if that makes any difference to setting the Gfx and DPM 3 frequencies (should raise the limits used by OD to 2500/1050 MHz)? You could also see if turning the amdgpu.ppfeaturemask=0xffffffff
boot flag on or off makes any difference.
Do you know if and to what extent the memory is possible to adjust in Windows/Radeon Software?
Also try to set only the DPM 3 clock (1020 MHz in the example below) using UPP to confirm that the same thing happens and that it's not something buggy in powerupp upp set smc_pptable/FreqTableUclk/3=1020 --write
.
Hello
I already have the featuremask.
[root@quasd ~]# grep -o amdgpu.ppfeaturemask=0xffffffff /proc/cmdline
amdgpu.ppfeaturemask=0xffffffff
After adjusting voltage offset -115 and setting power limit to 233 below seems ineffective. ( don't know the magic strings so have to do from powerupp)
[root@quasd ~]# upp set smc_pptable/DcModeMaxFreq/0=2560 --write
Changing smc_pptable.DcModeMaxFreq.0 from 2460 to 2560 at 0x626
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
no errors, but seems to be ineffective. The core clock is still stuck to 2450.
For the DPM3
upp set smc_pptable/DcModeMaxFreq/2=1050 --write
seems to be also ineffective.
Other notes
- Temps are around 80c, is this some magic point where it stops boosting?
- Power usage is jumping around 200 W
- Testing with overwatch and practice range
- Setting fan speed to 100% and waiting for the card to cool down and retrying allowed me to reach 2470
- edit: typos
Also try to set only the DPM 3 clock (1020 MHz in the example below) using UPP to confirm that the same thing happens and that it's not something buggy in powerupp
upp set smc_pptable/FreqTableUclk/3=1020 --write
.
[root@quasd ~]# upp set smc_pptable/FreqTableUclk/3=1020 --write
Changing smc_pptable.FreqTableUclk.3 from 1000 to 1020 at 0x584
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
[root@quasd ~]#
Setting it twice? in a row makes following to happen.
[root@quasd ~]# upp set smc_pptable/FreqTableUclk/3=1020 --write
Changing smc_pptable.FreqTableUclk.3 from 1020 to 1020 at 0x584
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
Traceback (most recent call last):
File "/usr/bin/upp", line 33, in <module>
sys.exit(load_entry_point('upp==0.0.7.post2', 'console_scripts', 'upp')())
File "/usr/lib/python3.9/site-packages/upp/upp.py", line 336, in main
cli(obj={})()
File "/usr/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/lib/python3.9/site-packages/upp/upp.py", line 318, in set
decode._write_pp_tables_file(pp_file, pp_bytes)
File "/usr/lib/python3.9/site-packages/upp/decode.py", line 47, in _write_pp_tables_file
f.close()
OSError: [Errno 62] Timer expired
[root@quasd ~]#
On journalctl side following happens. The line about fan speed keeps repeating forever 4x in 1s.
Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: failed send message: TransferTableDram2Smu (19) param: 0x00000000 response 0xffffffc2
Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to transfer pptable to SMC!
Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to setup smc hw!
Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: smu reset failed, ret = -62
Dec 20 21:53:08 quasd kernel: amdgpu: manual fan speed control should be enabled first
And amdgpu_pm_info changes to following
[root@quasd ~]# cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Clock Gating Flags Mask: 0x38118305
Graphics Medium Grain Clock Gating: On
Graphics Medium Grain memory Light Sleep: Off
Graphics Coarse Grain Clock Gating: On
Graphics Coarse Grain memory Light Sleep: Off
Graphics Coarse Grain Tree Shader Clock Gating: Off
Graphics Coarse Grain Tree Shader Light Sleep: Off
Graphics Command Processor Light Sleep: Off
Graphics Run List Controller Light Sleep: Off
Graphics 3D Coarse Grain Clock Gating: On
Graphics 3D Coarse Grain memory Light Sleep: Off
Memory Controller Light Sleep: On
Memory Controller Medium Grain Clock Gating: On
System Direct Memory Access Light Sleep: Off
System Direct Memory Access Medium Grain Clock Gating: Off
Bus Interface Medium Grain Clock Gating: Off
Bus Interface Light Sleep: Off
Unified Video Decoder Medium Grain Clock Gating: Off
Video Compression Engine Medium Grain Clock Gating: Off
Host Data Path Light Sleep: On
Host Data Path Medium Grain Clock Gating: On
Digital Right Management Medium Grain Clock Gating: Off
Digital Right Management Light Sleep: Off
Rom Medium Grain Clock Gating: Off
Data Fabric Medium Grain Clock Gating: Off
Address Translation Hub Medium Grain Clock Gating: On
Address Translation Hub Light Sleep: On
dpm not enabled
[root@quasd ~]#
Missing all the clock, temp etc information. The system also becomes unresponsive occasionally. edit: remove speculation of what might be the cause
- Temps are around 80c, is this some magic point where it stops boosting?
The 6800 test file that I have has a fan target temperature of 80 degrees, not sure if it's the same model that you have but seems likely that they have the same target at least. You could try to increase it but be careful not to overheat the card upp set smc_pptable/FanTargetTemperature=85 --write
What monitor frequency are you running at? Dual monitors? Can you try to change the frequency and maybe set to single monitor and change connection (HDMI/DP) if possible and see if it makes any difference for the memory clock setting. The repeating error messages can be caused by powerupps (or other application) periodical hwmon readings.
What monitor frequency are you running at? Dual monitors? Can you try to change the frequency and maybe set to single monitor and change connection (HDMI/DP) if possible and see if it makes any difference for the memory clock setting. The repeating error messages can be caused by powerupps (or other application) periodical hwmon readings.
Main screen 240hz 1440p, secondary monitor 60hz 1080p both DP Tried with only 1 screen hdmi. Memory setting still didn't work. And setting memory more than 1 results in instability/freezes.
upp set smc_pptable/FreqTableUclk/3=1020 --write
edit: also I am now running kernel 5.10.1-1 with few patches ( hopefully irrelevant ). And that didn't improve the situation.
"enable_additional_cpu_optimizations-$_gcc_more_v.tar.gz::https://github.com/graysky2/kernel_gcc_patch/archive/$_gcc_more_v.tar.gz"
0015-zfs.patch
"0001-futex-patches.patch::https://raw.githubusercontent.com/Frogging-Family/linux-tkg/master/linux59-tkg/linux59-tkg-patches/0007-v5.9-fsync.patch"
0001-ZEN-Add-sysctl-and-CONFIG-to-disallow-unprivileged-C.patch
0002-Bluetooth-Fix-LL-PRivacy-BLE-device-fails-to-connect.patch
0003-Bluetooth-Fix-attempting-to-set-RPA-timeout-when-uns.patch
0004-HID-quirks-Add-Apple-Magic-Trackpad-2-to-hid_have_sp.patch
upp set smc_pptable/FanTargetTemperature=85 --write
Sadly even with this max I can get is 2475 mhz
root@quasd ~# upp set smc_pptable/DcModeMaxFreq/0=2550 --write
Changing smc_pptable.DcModeMaxFreq.0 from 2475 to 2550 at 0x626
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
root@quasd ~# upp set smc_pptable/FanTargetTemperature=90 --write
Changing smc_pptable.FanTargetTemperature from 80 to 90 at 0x720
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
root@quasd ~#
Setting mem clock on my RX5700 is also very flaky, it would only accept certain values and crash with most others. Often, a difference of just one MHz would result the card to crash (mostly unrecoverable, needs HW reset), and it does not matter if you increase or decrease the clock. And also, it does not matter if you use upp (pp_table interface) or radeon-clocks (kernel sysfs API) to change these clocks, it is something in firmware/SMU/RAM timings that makes it crash.
I had to determine a certain set of "safe" clocks by trial and error :|
edit: also I am now running kernel 5.10.1-1 with few patches ( hopefully irrelevant ). And that didn't improve the situation.
As for now the 5.10 kernel seems to at least break the power limit setting possibility for Navi 10 for some reason, still need to do some further digging to understand why this happens and if there are any workarounds. But since you also tried 5.9 that shouldn't be the main culprit. There might still be driver/firmware issues that AMD will sort out eventually.
Are you able to lower the power limit by the way (set it to 150 W for example, using powerupp)?
I'm dropping in since I'm testing on 6800XT. PowerUPP reads all values correctly but nothing changes when trying to apply something new.
cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
doesn't show change when applying something with PowerUPP, echo 293000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
however does work and sets the power limit to maximum. I also tried undervolting heavily but nothing seemed to crash the system so that probably isn't working either.
PowerUPP does ask for permissions when applying so I wouldn't think it's a problem with permissions. Glad to help with debugging, not really familiar with the interfaces so can't do it by myself.
I'm dropping in since I'm testing on 6800XT. PowerUPP reads all values correctly but nothing changes when trying to apply something new.
cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
doesn't show change when applying something with PowerUPP,echo 293000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
however does work and sets the power limit to maximum. I also tried undervolting heavily but nothing seemed to crash the system so that probably isn't working either.PowerUPP does ask for permissions when applying so I wouldn't think it's a problem with permissions. Glad to help with debugging, not really familiar with the interfaces so can't do it by myself.
Thanks, can you try to run PowerUPP from terminal (powerupp
) while setting the values and see if anything strange shows there?
Getting any errors with dmesg | grep amdgpu
?
Main screen 240hz 1440p, secondary monitor 60hz 1080p both DP Tried with only 1 screen hdmi. Memory setting still didn't work. And setting memory more than 1 results in instability/freezes.
I skimmed through some forums and it appears to be difficult to adjust the memory clock at least on high refresh rate monitors (note that the memory clocks are reported differently under Windows and Linux, I believe the values are halved under Linux).
I'm dropping in since I'm testing on 6800XT. PowerUPP reads all values correctly but nothing changes when trying to apply something new.
cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
doesn't show change when applying something with PowerUPP,echo 293000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
however does work and sets the power limit to maximum. I also tried undervolting heavily but nothing seemed to crash the system so that probably isn't working either. PowerUPP does ask for permissions when applying so I wouldn't think it's a problem with permissions. Glad to help with debugging, not really familiar with the interfaces so can't do it by myself.Thanks, can you try to run PowerUPP from terminal (
powerupp
) while setting the values and see if anything strange shows there? Getting any errors withdmesg | grep amdgpu
?
That was helpful, I ran PowerUPP on terminal and got following:
Sorry, user root is not allowed to execute '/bin/zsh -c /usr/bin/upp --pp-file /sys/class/drm/card0/device/pp_table set --write smc_pptable/MaxVoltageGfx=4000 smc_pptable/SocketPowerLimitAc/0=293 smc_pptable/FreqTableGfx/1=2577 smc_pptable/MemMvddVoltage/3=5400 smc_pptable/MemVddciVoltage/3=3400 smc_pptable/FreqTableUclk/3=1000 smc_pptable/MaxVoltageSoc=4600 smc_pptable/FreqTableSocclk/1=1200 smc_pptable/qStaticVoltageOffset/0/c=0.000000 smc_pptable/MemMvddVoltage/0=5000 smc_pptable/MemVddciVoltage/0=2700 smc_pptable/FreqTableUclk/0=97 smc_pptable/MemMvddVoltage/1=5400 smc_pptable/MemVddciVoltage/1=3200 smc_pptable/FreqTableUclk/1=457 smc_pptable/MemMvddVoltage/2=5400 smc_pptable/MemVddciVoltage/2=3400 smc_pptable/FreqTableUclk/2=674 smc_pptable/MinVoltageGfx=3524 smc_pptable/MinVoltageSoc=3800' as user on arch-pc.
Sorry, user root is not allowed to execute '/usr/sbin/tee /sys/class/hwmon/hwmon3/power1_cap' as root on arch-pc.
After trying the command with sudo it worked and lowered the voltage and changed power limit. Any ideas how to get it work, root not a user of some group? And would probably be useful to insert some sort of error message in the GUI if this happens.
edit: changing gfx frequency doesn't seem to work, if I raise it the gpu doesn't boost anymore. But this is probably driver issue.
After trying the command with sudo it worked and lowered the voltage and changed power limit. Any ideas how to get it work, root not a user of some group?
It would appear that root is not allowed to issue sudo commands? Maybe try something like this?
And would probably be useful to insert some sort of error message in the GUI if this happens.
Absolutely. I just pushed a new commit to the bignavi branch, before you fix the cause let me know if this works as intended please.
edit: changing gfx frequency doesn't seem to work, if I raise it the gpu doesn't boost anymore. But this is probably driver issue.
It seems so unfortunately, hopefully something that will get sorted out.
Are you able to lower the power limit by the way (set it to 150 W for example, using powerupp)?
Setting it to 150W seems to work just fine and it's also reflected in the power draw.
However when trying to go back to 233 W I got the following.
Dec 21 19:27:00 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: New power limit (233) is over the max allowed 172
:D
However when trying to go back to 233 W I got the following.
Dec 21 19:27:00 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: New power limit (233) is over the max allowed 172
Interesting. For navi 10 this error message only seems to appear under kernel 5.10 and when not using the amdgpu.ppfeaturemask=0xffffffff
flag, do you still have it set?
Probably not important but noteable is that contrary to navi 10 it seems to properly calculate the max allowed (150 + 15%), with navi 10 it would have said "max allowed 150" (the actual max value set in the powerplay table). The message (at least for navi 10) is not triggered when setting the powerplay table values but when applying the value to sysfs.
Can you set the power cap manually echo 233000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
after getting this error? If it's possible it would seem like there's a timing issue (PowerUPP tries to set the sysfs value before the re-initialization of the powerplay table is complete), which would be fixable.
After trying the command with sudo it worked and lowered the voltage and changed power limit. Any ideas how to get it work, root not a user of some group?
It would appear that root is not allowed to issue sudo commands? Maybe try something like this?
That was it, that is commented by default in arch sudo package. Thanks.
Interesting. For navi 10 this error message only seems to appear under kernel 5.10 and when not using the amdgpu.ppfeaturemask=0xffffffff flag, do you still have it set?
Below test is without the flag
[root@quasd ~]# cat /proc/cmdline | grep amd
[root@quasd ~]#
Can you set the power cap manually
echo 233000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
after getting this error? If it's possible it would seem like there's a timing issue (PowerUPP tries to set the sysfs value before the re-initialization of the powerplay table is complete), which would be fixable.
Appears still to be broken
[root@quasd ~]# echo 233000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
233000000
[root@quasd ~]# cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
150000000
[root@quasd ~]#
and on the syslog
Dec 21 22:10:12 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: New power limit (233) is over the max allowed 150