RyzenAdj icon indicating copy to clipboard operation
RyzenAdj copied to clipboard

Possible race condition when setting indirect space registers

Open Iceburgino opened this issue 3 months ago • 1 comments

As I understand, both /dev/mem and ryzen_smu backends work by addressing a memory-mapped indirect space register.

Each operation is performed in two steps - setting the target address, and then reading/writing from that address. I suspect that this is susceptible to a race condition where the same indirect register is used in two places concurrently. Example:

https://github.com/FlyGoat/RyzenAdj/blob/f011f0e468a255b74d18df5c16d8e3fac669acff/lib/linux/osdep_linux_mem.c#L84

Reader/Writer A -> Sets value for a target address
Reader/Writer B -> Overwrites the address with its own
Reader/Writer A -> Reads/writes from/to the data register
Reader/Writer B -> Reads/writes from/to the data register

Reader/writer A gets/puts data from/to the register of reader/writer B.

And this, if a real issue, is probably true not only about the /dev/mem, but about ryzen_smu too - it's just that this race condition happens between reading the rsp register and deciding that you are free to write. Locks within ryzen_smu wouldn't help, since they only serialize operations relative to the driver itself, but if other drivers try to access the registers, you're still screwed.

https://github.com/amkillam/ryzen_smu/blob/172c316f53ac8f066afd7cb9e1da517084273368/smu.c#L190C9-L190C25

I think I have a potential candidate for the reproduce of this issue. The clash seems to happen between amdgpu and /dev/mem-backed ryzenadj when i disconnect laptop from AC. Here is a snippet of it happening:

Sep 07 21:16:37 nixos systemd[1]: Started AMD CPU Power Management.
Sep 07 21:16:37 nixos ryzenadj-apply[107468]: no compatible ryzen_smu kernel module found, fallback to /dev/mem
Sep 07 21:16:38 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* aie2_smu_exec: smu cmd 7 timed out
Sep 07 21:16:38 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* npu4_set_dpm: Set soft dpm level 0 failed, ret -110
Sep 07 21:16:39 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* aie2_smu_exec: smu cmd 4 timed out
Sep 07 21:16:39 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* aie2_smu_fini: Power off failed, ret -110
Sep 07 21:16:47 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Dumping IP State
Sep 07 21:17:01 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:01 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:07 nixos kernel: ACPI Error: Aborting method \_SB.A018 due to previous error (AE_AML_LOOP_TIMEOUT) (20250404/psparse-529)
Sep 07 21:17:07 nixos org_kde_powerdevil[2927]: kf.notifications: Playing audio notification failed: IO error
Sep 07 21:17:12 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:12 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:21 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:21 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Dumping IP State Completed
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Sep 07 21:17:30 nixos kernel: ACPI Error: Aborting method \_SB.ALIB due to previous error (AE_AML_LOOP_TIMEOUT) (20250404/psparse-529)
Sep 07 21:17:30 nixos systemd-logind[2014]: Power key pressed short.
Sep 07 21:17:30 nixos dbus-daemon[2572]: [session uid=1000 pid=2572] Activating service name='org.kde.LogoutPrompt' requested by ':1.31' (uid=1000 pid=2927 comm="/nix/store/09ds7q6mg0h4rvgwjjdmga8nja3yrih4-powerd")
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=261123, emitted seq=261125
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Process information: process .plasmashell-wr pid 2847 thread plasmashel:cs0 pid 2895
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Starting gfx_0.0.0 ring reset
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=RESET
Sep 07 21:17:33 nixos kernel: [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: reset via MES failed and try pipe reset -110
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Ring gfx_0.0.0 reset failed
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: GPU reset begin!
Sep 07 21:17:37 nixos (udev-worker)[107335]: BAT0: Spawned process '/nix/store/dq5fdsfzs2p4ixvy204a8m8j5fkgf6zv-tlp-1.8.0/sbin/tlp auto' [107527] is taking longer than 59s to complete.
Sep 07 21:17:37 nixos systemd-udevd[97862]: BAT0: Worker [107335] processing SEQNUM=6294 is taking a long time
Sep 07 21:17:57 nixos kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u64:0:106018]
Sep 07 21:17:57 nixos kernel: Modules linked in: ccm qrtr xt_set ip_set xt_addrtype xfrm_user xfrm_algo overlay rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat af_packet cmac algif_hash algif_skcipher af_alg bnep xt_conntrack ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog nft_compat nf_tables sch_fq_codel xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth tun nvidia_uvm(O) nvidia_drm(O) uvcvideo nvidia_modeset(O) videobuf2_vmalloc uvc videobuf2_memops nls_iso8859_1 videobuf2_v4l2 nls_cp437 videobuf2_common btusb vfat fat btrtl videodev btintel btbcm btmtk mc bluetooth onboard_usb_dev nvidia(O) amdgpu snd_acp_legacy_mach snd_acp_mach snd_soc_nau8821 snd_acp3x_rn mt7925e mt7925_common snd_acp70 snd_acp_i2s mt792x_lib snd_acp_pdm snd_acp_pcm snd_soc_dmic snd_sof_amd_acp70 mt76_connac_lib snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt mt76 snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp mac80211 snd_sof joydev
Sep 07 21:17:57 nixos kernel:  snd_ctl_led mousedev snd_sof_utils snd_pci_ps snd_hda_codec_realtek snd_soc_acpi_amd_match snd_amd_sdw_acpi soundwire_amd snd_hda_codec_generic soundwire_generic_allocation snd_hda_scodec_component soundwire_bus snd_hda_codec_hdmi snd_hda_intel snd_soc_sdca snd_intel_dspcfg spd5118 hid_multitouch lenovo_wmi_hotkey_utilities intel_rapl_msr wmi_bmof snd_intel_sdw_acpi snd_soc_core cfg80211 snd_hda_codec snd_compress ac97_bus snd_pcm_dmaengine snd_rpl_pci_acp6x snd_acp_pci amdxcp drm_panel_backlight_quirks snd_amd_acpi_mach drm_buddy snd_acp_legacy_common drm_exec drm_suballoc_helper r8169 snd_pci_acp6x drm_display_helper snd_pci_acp5x snd_hda_core snd_rn_pci_acp3x realtek mdio_devres of_mdio fixed_phy sp5100_tco fwnode_mdio watchdog snd_acp_config snd_hwdep edac_mce_amd cec libphy ucsi_acpi amdxdna snd_pcm edac_core typec_ucsi ideapad_laptop snd_soc_acpi i2c_piix4 i2c_algo_bit drm_ttm_helper sparse_keymap ttm rfkill amd_atl intel_rapl_common polyval_clmulni snd_timer video ghash_clmulni_intel rapl roles
Sep 07 21:17:57 nixos kernel:  amd_pmf tpm_crb gpu_sched k10temp i2c_smbus crc16 snd snd_pci_acp3x libarc4 mdio_bus battery soundcore rtc_cmos thermal amdtee i2c_hid_acpi typec evdev wmi i2c_hid tiny_power_button amd_sfh tpm_tis platform_profile mac_hid tpm_tis_core button msr serio_raw ac psmouse loop kvm_amd ccp kvm irqbypass br_netfilter bridge fuse stp llc configfs efi_pstore nfnetlink efivarfs dmi_sysfs ip_tables autofs4 crc32c_cryptoapi dm_crypt encrypted_keys trusted asn1_encoder tee tpm rng_core libaescfb ecdh_generic ecc hid_generic usbhid hid input_leds led_class atkbd nvme libps2 xhci_pci vivaldi_fmap thunderbolt nvme_core xhci_hcd sha512_ssse3 i8042 sha1_ssse3 aesni_intel nvme_keyring serio nvme_auth dm_mod dax btrfs blake2b_generic xor raid6_pq
Sep 07 21:17:57 nixos kernel: CPU: 0 UID: 0 PID: 106018 Comm: kworker/u64:0 Tainted: G     U     O        6.16.5 #1-NixOS PREEMPT(voluntary) 
Sep 07 21:17:57 nixos kernel: Tainted: [U]=USER, [O]=OOT_MODULE
Sep 07 21:17:57 nixos kernel: Hardware name: LENOVO 83F1/LNVNB161216, BIOS RYCN23WW 06/27/2025
Sep 07 21:17:57 nixos kernel: Workqueue: dm_vblank_control_workqueue amdgpu_dm_crtc_vblank_control_worker [amdgpu]
Sep 07 21:17:57 nixos kernel: RIP: 0010:amdgpu_device_rreg.part.0+0x38/0xe0 [amdgpu]
Sep 07 21:17:57 nixos kernel: Code: 00 55 89 f5 53 48 89 fb 4c 3b a7 08 09 00 00 73 1b 83 e2 02 75 09 f6 87 a8 4b 05 00 10 75 77 4c 03 a3 10 09 00 00 45 8b 24 24 <eb> 12 4c 89 e6 48 8b 87 50 09 00 00 ff d0 0f 1f 00 41 89 c4 66 90
Sep 07 21:17:57 nixos kernel: RSP: 0018:ffffcc7e2099fc48 EFLAGS: 00000286
Sep 07 21:17:57 nixos kernel: RAX: ffffffffc2298a50 RBX: ffff8a02c3580000 RCX: 0000000000000000
Sep 07 21:17:57 nixos kernel: RDX: 0000000000000000 RSI: 0000000000003697 RDI: ffff8a02c3580000
Sep 07 21:17:57 nixos kernel: RBP: 0000000000003697 R08: 0000000000002000 R09: 0000000000000980
Sep 07 21:17:57 nixos kernel: R10: ffffcc7e5f91d100 R11: fefefefefefefeff R12: 0000000000000940
Sep 07 21:17:57 nixos kernel: R13: 0000000000000001 R14: ffffcc7e2099fdc7 R15: 0000000000000000
Sep 07 21:17:57 nixos kernel: FS:  0000000000000000(0000) GS:ffff8a0a0bb4f000(0000) knlGS:0000000000000000
Sep 07 21:17:57 nixos kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 07 21:17:57 nixos kernel: CR2: 00007ff372607b09 CR3: 000000012be82000 CR4: 0000000000f50ef0
Sep 07 21:17:57 nixos kernel: PKRU: 55555554
Sep 07 21:17:57 nixos kernel: Call Trace:
Sep 07 21:17:57 nixos kernel:  <TASK>
Sep 07 21:17:57 nixos kernel:  dm_read_reg_func+0x57/0xe0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dmub_srv_update_inbox_status.part.0+0x12/0xd0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dmub_srv_wait_for_idle+0x2c/0xa0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dc_dmub_srv_wait_for_idle+0x50/0x150 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dmub_psr_enable+0x8f/0x110 [amdgpu]
Sep 07 21:17:57 nixos kernel:  edp_set_psr_allow_active+0x27b/0x3b0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  amdgpu_dm_psr_disable+0x51/0x70 [amdgpu]
Sep 07 21:17:57 nixos kernel:  amdgpu_dm_crtc_vblank_control_worker+0x277/0x280 [amdgpu]
Sep 07 21:17:57 nixos kernel:  process_one_work+0x18a/0x340
Sep 07 21:17:57 nixos kernel:  worker_thread+0x225/0x360
Sep 07 21:17:57 nixos kernel:  ? __pfx_worker_thread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  kthread+0xf8/0x250
Sep 07 21:17:57 nixos kernel:  ? finish_task_switch.isra.0+0x99/0x2e0
Sep 07 21:17:57 nixos kernel:  ? __pfx_kthread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  ? __pfx_kthread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  ret_from_fork+0x164/0x190
Sep 07 21:17:57 nixos kernel:  ? __pfx_kthread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  ret_from_fork_asm+0x1a/0x30
Sep 07 21:17:57 nixos kernel:  </TASK>

So my question is - is my intuition about it right, and is this that issue I'm seeing? Or is it something else?

Iceburgino avatar Sep 12 '25 08:09 Iceburgino

Here is the link to what, if true, I consider to be the dumbest, most reckless, and irresponsible thing to do on AMD's part - allowing concurrency issues with a mailbox implementation instead of providing a normal atomic API:

https://www.overclock.net/posts/28950585/

We are talking about an SMU - a unit that can set voltages, currents and temperature limits...

Iceburgino avatar Sep 16 '25 22:09 Iceburgino