LACT icon indicating copy to clipboard operation
LACT copied to clipboard

investigate regarding drm/amd issue 3131

Open andrew-ld opened this issue 1 year ago • 17 comments

hi, I am the author of the issue https://gitlab.freedesktop.org/drm/amd/-/issues/3131, I think lact developers should be aware of this issue, especially the last comments.

https://gitlab.freedesktop.org/drm/amd/-/issues/3131#note_2415553

andrew-ld avatar May 16 '24 07:05 andrew-ld

Interesting - there have already been issues with the order in which settings are applied, but lact should handle what's described in the issue fine. The current order for apply settings is:

  • Power cap
  • Clocks table
  • Performance level
  • Fan curve

Code that handles this: https://github.com/ilya-zlobintsev/LACT/blob/master/lact-daemon/src/server/gpu_controller/mod.rs#L719 Did you manage to hit the issue when applying the settings in lact, or are you just informing about its existence?

ilya-zlobintsev avatar May 16 '24 15:05 ilya-zlobintsev

I opened this issue to keep track of the status of things, however actually even I can't change the fan speed on my sapphire 7900xtx.

for example, I've tried firing the fans to full on the curve and also with static speed and nothing seems to happen.

andrew-ld avatar May 16 '24 16:05 andrew-ld

Is this the case only when you set the fan speed using lact, or when manually writing to the sysfs (like the examples in the linked issue) as well?

ilya-zlobintsev avatar May 16 '24 16:05 ilya-zlobintsev

lact

andrew-ld avatar May 16 '24 18:05 andrew-ld

Interesting - there have already been issues with the order in which settings are applied, but lact should handle what's described in the issue fine. The current order for apply settings is:

* Power cap

* Clocks table

* Performance level

* Fan curve

Code that handles this: https://github.com/ilya-zlobintsev/LACT/blob/master/lact-daemon/src/server/gpu_controller/mod.rs#L719 Did you manage to hit the issue when applying the settings in lact, or are you just informing about its existence?

When I write anything to /sys/class/drm/card?/device/gpu_od/fan_ctrl/{acoustic_limit_rpm_threshold,acoustic_target_rpm_threshold,fan_minimum_pwm,fan_target_temperature}, everything set via pp_od_clk_voltage gets ignored by the GPU, no matter in which order they are set. So it is not possible to alter, for example, fan_minimum_pwm when also setting clock speeds. fan_curve seems to be the exception in my limited testing when set before altering pp_od_clk_voltage.

Doing things manually, it is possible to set clock speeds, voltage offset, and fan curve, but I am unable to do so in LACT without it getting ignored by the GPU since LACT seems to always restore the serialized values for those aforementioned settings even if they are default values.

https://github.com/ilya-zlobintsev/LACT/blob/0d675c5b3a09be4f5fdcbc441b618cea7158d79f/lact-daemon/src/server/gpu_controller/mod.rs#L817-L841

Though I am aware this is primarily a driver issue, it would be nice to have a way to not write to all fan_ctrl/* sysfs files when applying other settings/launching lactd.

Sapphire NITRO+ RX 7900 XTX Vapor-X
Kernel 6.10.0-0.rc3.20240612git2ef5971ff345.33

zenofile avatar Jun 13 '24 15:06 zenofile

Makes sense, we can at least avoid writing to the files if the value is unchanged.

ilya-zlobintsev avatar Jun 13 '24 16:06 ilya-zlobintsev

@zenofile i've added checks for this in https://github.com/ilya-zlobintsev/LACT/commit/ca3e54015a39f7cc0c840643def5e642ef8ef101, could you test if it helps?

ilya-zlobintsev avatar Jun 13 '24 16:06 ilya-zlobintsev

Thanks for looking into this. When the Automatic fan mode is enabled with default values, it seems it is working like intended, however when Curve is active, even with default values, it doesn't seem to work.

Thermals → Automatic, default values OC → Basic → Clocks + Voltage offset altered → Apply

⇒ OC Values are applied and working, however fan_curve is still written to (reset?).

Click to expand inotify event list Each inotify event report is from a single application of said values.
# inotifywait -r -m -e modify .
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve

Thermals → Curve, default values OC → Basic → Clocks + Voltage offset altered → Apply

⇒ OC values are ignored by the GPU, fan_curve is written to last.

Click to expand inotify event list
# inotifywait -r -m -e modify .
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve

But this is better than it was before; now when lactd is restarted at least clockspeed and voltage values are respected in unaltered automatic fan mode (default).

zenofile avatar Jun 14 '24 15:06 zenofile

I tried experimenting with the order a little: writing any values into pp_od_clk_voltage after the fan values are committed, the OC settings get ignored by the GPU. The actual committing can be done in any order though. So ensuring to only commit at the end after everything is written, it works fine. Maybe this was clear from the beginning, but I did not find any documentation mentioning this.

Also resets can be issued on fan_curve, acoustic_limit_rpm_threshold and acoustic_target_rpm_threshold. Any reset on fan_minimum_pwm or fan_target_temperature after pp_od_clk_voltage was committed and the OC settings are getting ignored again 🤷🏻 .

For example, this works fine:

gpu=card1
device=/sys/class/drm/${gpu}/device
fan=/sys/class/drm/${gpu}/device/gpu_od/fan_ctrl

echo 'r' > $fan/fan_target_temperature
echo 'r' > $fan/acoustic_target_rpm_threshold
echo 'r' > $fan/acoustic_limit_rpm_threshold
echo 'r' > $fan/fan_minimum_pwm

sleep 0.25s

echo 'auto' > $device/power_dpm_force_performance_level

echo '25' > $fan/fan_minimum_pwm
echo '75' > $fan/fan_target_temperature

echo 's 1 2525' > $device/pp_od_clk_voltage
echo 'vo -100' > $device/pp_od_clk_voltage

echo 'c' > $fan/fan_minimum_pwm
echo 'c' > $fan/fan_target_temperaturee
echo 'c' > $device/pp_od_clk_voltage

zenofile avatar Jun 14 '24 17:06 zenofile

Interesting. Currently the values are committed right away, i'll see if i can make it deferred until everything is written

ilya-zlobintsev avatar Jun 15 '24 12:06 ilya-zlobintsev

@zenofile i've pushed the new logic where everything is committed at once to the deferred-commit branch, could you test if it works?

ilya-zlobintsev avatar Jun 15 '24 12:06 ilya-zlobintsev

Unfortunately the OD values get ignored.

Some data when launching the lact daemon, all relevant GPU settings were reset manually beforehand (but it makes no difference when not):

  • info.json:
{
  "initramfs_type": "Dracut",
  "system_info": {
    "amdgpu_overdrive_enabled": true,
    "commit": "8638d24",
    "kernel_version": "6.10.0-0.rc3.20240612git2ef5971ff345.36.local.fc40.x86_64",
    "profile": "release",
    "version": "0.5.5"
  }
}
  • /etc/lact/config.yaml:
daemon:
  log_level: debug
  admin_groups:
  - wheel
  - sudo
  disable_clocks_cleanup: false
apply_settings_timer: 5
gpus:
  xxx-0000:03:00.0:
    fan_control_enabled: false
    fan_control_settings:
      mode: curve
      static_speed: 0.5
      temperature_key: edge
      interval_ms: 500
      curve:
        40: 0.15
        50: 0.29999998
        60: 0.45
        70: 0.65
        80: 0.9
      spindown_delay_ms: 0
      change_threshold: 0
    pmfw_options:
      acoustic_limit: 3200
      acoustic_target: 1450
      minimum_pwm: 25
      target_temperature: 75
    performance_level: auto
    max_core_clock: 2525
    voltage_offset: -100
    power_states: {}
  • /usr/bin/lact daemon
DEBUG lact_daemon: current system uptime: 3162.4s
 INFO lact_daemon::socket: listening on "/var/run/lactd.sock"
DEBUG lact_daemon::server::handler: initialized GPU controller xxx-0000:03:00.0 for path "/sys/class/drm/card1/device"
DEBUG lact_daemon::server::handler: found intialized drm entry for device "/sys/bus/pci/devices/0000:03:00.0"
 INFO lact_daemon::server::handler: initialized 1 GPUs
DEBUG lact_daemon::server::gpu_controller: writing clocks commands: [
    "s 1 2525",
    "vo -100",
]
  • inotifywait -qrme modify .
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm

When altering fan and clock settings in the GUI and applying, the values are ignored as well and the inotify event list is quite extensive.

inotify events
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./ MODIFY power_dpm_force_performance_level
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_curve
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm
./ MODIFY pp_od_clk_voltage
./ MODIFY pp_od_clk_voltage
./gpu_od/fan_ctrl/ MODIFY fan_target_temperature
./gpu_od/fan_ctrl/ MODIFY fan_minimum_pwm

It would help to see what is actually written to the sysfs by the daemon, is there a logging setting I can enable? Debug level seems to only print clockspeed settings.

zenofile avatar Jun 15 '24 17:06 zenofile

I did strace the writes and tried it manually in that order. The culprit is the reset on fan_curve. Somehow in this example, it causes issues. When leaving it out or moving it after the writes to fan_target_temperature and fan_minimum_pwm or before writes to pp_od_clk_voltage, it seems to work fine. What a mess.

write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "r\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/power_dpm_force_performance_level>, "auto", 4) = 4
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "s 1 2525\n", 9) = 9
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "vo -100\n", 8) = 8
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/power_dpm_force_performance_level>, "auto", 4) = 4
** write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_curve>, "r\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_target_temperature>, "76\n", 3) = 3
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_minimum_pwm>, "26\n", 3) = 3
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage>, "c\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_target_temperature>, "c\n", 2) = 2
write(10</sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/gpu_od/fan_ctrl/fan_minimum_pwm>, "c\n", 2) = 2

zenofile avatar Jun 15 '24 17:06 zenofile

I've pushed a commit to reset the fan curve after writing other pmfw values, please tell me if it helps. And thanks for the detailed debug - it's unfortunate that this is so fragile.

ilya-zlobintsev avatar Jun 15 '24 18:06 ilya-zlobintsev

It works. Restarting the daemon and altering and applying settings via GUI without daemon restart.

zenofile avatar Jun 15 '24 19:06 zenofile

Good to know, I will merge these changes then.

ilya-zlobintsev avatar Jun 15 '24 19:06 ilya-zlobintsev

@andrew-ld could you check if this also solves the problem for you?

ilya-zlobintsev avatar Jun 16 '24 08:06 ilya-zlobintsev

Closing as this has been implemented and released.

ilya-zlobintsev avatar Jul 18 '24 15:07 ilya-zlobintsev