AMDGPU doesn't work correctly
Describe the bug fan2go can't make a curve for amdgpu because writing PWM is not the same as reading. fan2go tried to make it again and again, but in the terminal, I saw:
...
WARNING PWM of gpu was changed by third party! Last set PWM value was: 145 but is now: 107
WARNING PWM of gpu was changed by third party! Last set PWM value was: 145 but is now: 106
WARNING PWM of gpu was changed by third party! Last set PWM value was: 145 but is now: 106
WARNING PWM of gpu was changed by third party! Last set PWM value was: 145 but is now: 105
WARNING PWM of gpu was changed by third party! Last set PWM value was: 144 but is now: 104
WARNING PWM of gpu was changed by third party! Last set PWM value was: 144 but is now: 103
WARNING PWM of gpu was changed by third party! Last set PWM value was: 144 but is now: 103
WARNING PWM of gpu was changed by third party! Last set PWM value was: 144 but is now: 102
WARNING PWM of gpu was changed by third party! Last set PWM value was: 142 but is now: 101
WARNING PWM of gpu was changed by third party! Last set PWM value was: 142 but is now: 100
WARNING PWM of gpu was changed by third party! Last set PWM value was: 142 but is now: 99
WARNING PWM of gpu was changed by third party! Last set PWM value was: 141 but is now: 98
WARNING PWM of gpu was changed by third party! Last set PWM value was: 141 but is now: 97
WARNING PWM of gpu was changed by third party! Last set PWM value was: 141 but is now: 96
WARNING PWM of gpu was changed by third party! Last set PWM value was: 139 but is now: 95
WARNING PWM of gpu was changed by third party! Last set PWM value was: 139 but is now: 94
WARNING PWM of gpu was changed by third party! Last set PWM value was: 139 but is now: 93
WARNING PWM of gpu was changed by third party! Last set PWM value was: 139 but is now: 91
....
As I understand firmware, approximate write PWM and set a little different value. I believe fan2go should create curves without data from the fan.
To Reproduce Just run any curve for amdgpu fan
Desktop (please complete the following information):
- Distro: [e.g. Arch Linux]
-
uname -a: Linux rainbow 5.18.12-gentoo #1 SMP PREEMPT_DYNAMIC Sat Jul 16 12:22:13 JST 2022 x86_64 AMD Ryzen 9 5900X 12-Core Processor AuthenticAMD GNU/Linux -
sensors -v: sensors version 3.6.0 with libsensors version 3.6.0 -
fan2go version: 0.7.0
Are you sure fan2go is the only program controlling the fan when you got that output?
If so: fan2go expects a fixed mapping of "I want this much PWM" -> "You get that much PWM" from the fan driver. fan2go will even check for such a mapping on first initialization of a fan, to account for drivers which set a different PWM value than requested (mostly due to conversion errors).
If the driver of your AMD card can't provide that, fan2go will think that some other program interferes with it, or that something else must be terribly wrong.
@markusressel do we have any guarantee from lmsensors kernel subsystem that the write value should be exactly the same as the return value? Quick googling shows what extra changes by the firmware on GPU are common. Anyway, I will make an issue on amdgpu gitlab.
fan2go expects a fixed mapping of "I want this much PWM" -> "You get that much PWM" from the fan driver.
As you see in the WARNING message, we still have some relation between write and read values for PWM, but not direct. I suppose fan2go should cover such a case, especially if it's ok for libsensors(lmsensors) and not a bug in the driver.
PS will be great to have posible to make curves DB manually.
Just put it here https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html#gpu-power-thermal-controls-and-monitoring
NOTE: DO NOT set the fan speed via “pwm1” and “fan[1-*]_target” interfaces at the same time. That will get the former one overridden.
I hope we don't do it, but I want to check.
do we have any guarantee from lmsensors kernel subsystem that the write value should be exactly the same as the return value?
We have real world examples for the opposite (see #64) so I am not sure what such a guarantee would be worth if it existed. I just assume common sense and it "works for me (TM)" :smile: .
Quick googling shows what extra changes by the firmware on GPU are common.
What common extra changes? What did you find and where? If those changes are predictable, then it should be fine as it is. If not, fan2go cannot account for them no matter what. :cry:
As you see in the WARNING message, we still have some relation between write and read values for PWM, but not direct.
Not necessarily. It seems like it, but its not always the same. F.ex. in your log output 145 could be mapped to 107, 106 as well as 105, with no logic as to why. There is no way fan2go can interpret this other than with a warning that something else must be interfering.
I hope we don't do it, but I want to check.
No. fan2go currently uses gosensors to find in- and outputs for lm-sensors based devices, and specifically creates a pwm* based file path for the pwm output:
https://github.com/markusressel/fan2go/blob/4868c6452eb5f3315c085d4a995e80de3803429c/internal/hwmon/hwmon.go#L165
- After all, why do we need to read PWM? As I understand basic things, we should make the relation between writing PWM and RPMs, no?
- #64 it's a huge and really good topic. I can find a fundamental issue for fan2go - you expect the hardware API will be 100% correct same as the software API, but it's not true in most cases. If you open kernel - most drivers have quirks and shims. As I understand, lmsensors write directly to the driver, and in most cases, it means directly to hardware, the same thing for reading. It means instead fan2go-(write)->lmsensors API and fan2go<-(read)-lmsensorsAPI API, we have fan2go-(write)->lmsensorsAPI->driver->hardware and the same chain for read. All this show that we have no read/write consistency (at least 100% correct), and fan2go should expect it. If you are writing a driver or any applications that directly work with the hardware you should expect not fully correct behavior. I know it's not beautiful, and if you never worked with hardware before, it's weird, but it is as it is.
As a user, I want to just disable somehow PWM read check. For example, https://github.com/chestm007/amdgpu-fan didn't check PWM after set. As I know fancontrol also didn't do it (they just detect PWM for start/stop fan and relation between PWM and RPM).
GitHub
Fan controller for AMD graphics cards running the amdgpu driver on Linux - GitHub - chestm007/amdgpu-fan: Fan controller for AMD graphics cards running the amdgpu driver on Linux
I made a few tests, plus I checked amdgpu code and found the reason why it happened - amdgpu show PWM based on fan RPM.
Firstly I played with PWM: If you set PWM and read the immediate results - nothing happened, you need what a little. Sometimes you can have different results. After I found this issue: https://gitlab.freedesktop.org/drm/amd/-/issues/1164 And after I just checked the code and it's very clear:
int vega10_fan_ctrl_get_fan_speed_pwm(struct pp_hwmgr *hwmgr,
uint32_t *speed)
{
uint32_t current_rpm;
uint32_t percent = 0;
if (hwmgr->thermal_controller.fanInfo.bNoFan)
return 0;
if (vega10_get_current_rpm(hwmgr, ¤t_rpm))
return -1;
if (hwmgr->thermal_controller.
advanceFanControlParameters.usMaxFanRPM != 0)
percent = current_rpm * 255 /
hwmgr->thermal_controller.
advanceFanControlParameters.usMaxFanRPM;
*speed = MIN(percent, 255);
return 0;
}
(My Vega56 is a vega10 chip) Looks like the driver on the hardware (firmware) level has no access to PWM and does all changes based on RPMs. At the same time, if you change fan_target it works much more predictable.
I tried to use the vega20 function to get PWM, and it's working mostly stable with +-2 lost accuracy only. Now I have an issue similar to #64 . Now it's looks like:
int vega20_fan_ctrl_get_fan_speed_pwm(struct pp_hwmgr *hwmgr,
uint32_t *speed)
{
struct amdgpu_device *adev = hwmgr->adev;
uint32_t duty100, duty;
uint64_t tmp64;
duty100 = REG_GET_FIELD(RREG32_SOC15(THM, 0, mmCG_FDO_CTRL1),
CG_FDO_CTRL1, FMAX_DUTY100);
duty = REG_GET_FIELD(RREG32_SOC15(THM, 0, mmCG_THERMAL_STATUS),
CG_THERMAL_STATUS, FDO_PWM_DUTY);
if (!duty100)
return -EINVAL;
tmp64 = (uint64_t)duty * 255;
do_div(tmp64, duty100);
*speed = MIN((uint32_t)tmp64, 255);
return 0;
}
As we can see, hardware expects 0-100, and we must interpolate to get 0-255.
I also created issue on gitlab - https://gitlab.freedesktop.org/drm/amd/-/issues/2108
GitLabCurrently, for vega10 (vega56), we generate PWM for lmsensors based on RPMs of the fan what not right, and makes programs madness that tries to control fan speed....
After my GPU function chenges + I increas RPM pooling timeout to 3s, I received for my GPU:

Okay, now I have a different issue - on PWM 0, I have a normal real fan RPM, but I have weird numbers from the driver (basically the max RPMS 3200). Or probably they return max RPM if the fan is stopped. UPDATE: if the current RPM is 0, the controller will return the latest valid state.
After all, why do we need to read PWM? As I understand basic things, we should make the relation between writing PWM and RPMs, no?
Well, checking that the RPM make sense is also an option, however not all fans support a dedicated RPM reading. I chose to check the PWM to make sure nothing else (like mainboard, or some other fan controller software) was interfering with the control of fan2go. Its a basic check to make sure it can work as intended.
As a user, I want to just disable somehow PWM read check. For example, https://github.com/chestm007/amdgpu-fan didn't check PWM after set. As I know fancontrol also didn't do it (they just detect PWM for start/stop fan and relation between PWM and RPM).
If other tools do safety checks differently (for whatever reason), and that suits your needs better than fan2go, feel free to use that software instead :smile: I wrote fan2go to fit my personal use case. I also published it for good measure, but not to make it suit other peoples needs before my own :sweat_smile:
It is a warning (and NOT an error!) for a reason. If you think its fine, you can ignore it. It serves my use case very well, so disabling this check completely would make fan2go worse for me. If you want to get rid of this check, you can also fork the repo and remove the code, althoguh that comes with the usual caveat of upstream changes not getting pulled in automatically etc.
If you are writing a driver or any applications that directly work with the hardware you should expect not fully correct behavior. I know it's not beautiful, and if you never worked with hardware before, it's weird, but it is as it is.
I didn't experience any weirdness, other people do. If fan2go isn't working in those scenarios, maybe a different tool is more appropriate. I certainly don't want to add loads of workarounds to fan2go, just because "most drivers have quirks and shims". Its an abstraction layer for a reason...
amdgpu show PWM based on fan RPM.
If you set PWM and read the immediate results - nothing happened, you need what a little. Sometimes you can have different results.
This will certainly mess up the check, and also with the initial mapping of PWM values, which is created when the fan is used for the first time. I would expect fan2go to fail controlling the fan in a useful way.
As we can see, hardware expects 0-100, and we must interpolate to get 0-255.
To overcome such a scenario, the previously mentioned PWM map is created. If the PWM measurement isn't immediately accurate, this will result in a messed up mapping.
Maybe it would be a possibility to add support for setting the target RPM instead of the PWM, if that leads to a stable output and input. However, it wouldn't be that straight forward, because we would also need to find out the possible range (like 0..255 but for RPM).
GitHub
Fan controller for AMD graphics cards running the amdgpu driver on Linux - GitHub - chestm007/amdgpu-fan: Fan controller for AMD graphics cards running the amdgpu driver on Linux
Maybe it would be a possibility to add support for setting the target RPM instead of the PWM, if that leads to a stable output and input. However, it wouldn't be that straight forward, because we would also need to find out the possible range (like 0..255 but for RPM).
Okay, it's one option, but I believe the second option where we detect PWM/RPM curve without checking PWM will be good as well and not increase complexity for users or applications. Basically, by a few extra parameters we can cover most corner cases.
However, it wouldn't be that straight forward, because we would also need to find out the possible range (like 0..255 but for RPM).
Some fans have rpm max/min params and it's easy to detect.
If other tools do safety checks differently (for whatever reason), and that suits your needs better than fan2go, feel free to use that software instead
I migrated to fan2go, especially because we have no other software for Linux that support such features. Basically, only fan2go support function curve.
I wrote fan2go to fit my personal use case. I also published it for good measure, but not to make it suit other peoples needs before my own.
I understand, but it's usually not working like this. I believe have no sense in publishing something if you don't expect to satisfy users. In that case, users will be frustrated, and you will be frustrated because users can be not polite or disagree with you. It will be difficult not to correspond to such a role (I have already such experience, and it was a really good lesson for me).
Sorry for the long silence, I finally sent the patch to the amdgpu mail list, however not sure about the success. https://lists.freedesktop.org/archives/amd-gfx/2022-September/083705.html
okay, 6.1 is out with my patch and now vega10 cards are working correctly.