amdgpu crash with 3.5.11
Your system information
- Steam client version: 1702667398
- SteamOS version: 3.5.11
- Opted into Steam client beta?: Yes
- Opted into SteamOS beta?: Yes
- Have you checked for updates in Settings > System?: Yes
Please describe your issue in as much detail as possible:
I expected gamescope and the gpu driver to not crash.
What happened:
Dec 00 00:44:53 steamdeck fancontrol.py[577]: Warning: CPU temperature of 94.0 greater than max 90! Setting fan to max speed.
Dec 00 00:44:54 steamdeck fancontrol.py[577]: Warning: CPU temperature of 92.2 greater than max 90! Setting fan to max speed.
Dec 00 00:44:55 steamdeck fancontrol.py[577]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 00 00:45:21 steamdeck fancontrol.py[577]: Warning: CPU temperature of 90.6 greater than max 90! Setting fan to max speed.
Dec 00 00:45:45 steamdeck dbus-daemon[572]: [system] Activating via systemd: service name='org.freedesktop.home1' unit='dbus-org.freedesktop.home1.service' requested by ':1.172' (uid=0 pid=5783 comm="sudo -s")
Dec 00 00:45:45 steamdeck dbus-daemon[572]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.home1.service': Unit dbus-org.freedesktop.home1.service not found.
Dec 00 00:48:19 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=35171, emitted seq=35175
Dec 00 00:48:19 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Dec 00 00:48:19 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Dec 00 00:48:19 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Dec 00 00:48:19 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 00 00:48:19 steamdeck kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
Dec 00 00:48:19 steamdeck kernel: [drm] PSP is resuming...
Dec 00 00:48:19 steamdeck (udev-worker)[5882]: devcd1: Process 'cat /sys/devices/virtual/devcoredump/devcd1/data > /var/lib/steamos-log-submitter/pending/devcoredump/4785' failed with exit code 1.
Dec 00 00:48:19 steamdeck kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
Dec 00 00:48:19 steamdeck fancontrol.py[577]: Traceback (most recent call last):
Dec 00 00:48:19 steamdeck fancontrol.py[577]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 542, in <module>
Dec 00 00:48:19 steamdeck fancontrol.py[577]: controller.loop_control()
Dec 00 00:48:19 steamdeck fancontrol.py[577]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 486, in loop_control
Dec 00 00:48:19 steamdeck fancontrol.py[577]: self.loop_read_sensors()
Dec 00 00:48:19 steamdeck fancontrol.py[577]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 452, in loop_read_sensors
Dec 00 00:48:19 steamdeck fancontrol.py[577]: self.power_sensor.get_avg_value()
Dec 00 00:48:19 steamdeck fancontrol.py[577]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 356, in get_avg_value
Dec 00 00:48:19 steamdeck fancontrol.py[577]: self.values.append(self.get_value())
Dec 00 00:48:19 steamdeck fancontrol.py[577]: ^^^^^^^^^^^^^^^^
Dec 00 00:48:19 steamdeck fancontrol.py[577]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 351, in get_value
Dec 00 00:48:19 steamdeck fancontrol.py[577]: self.value = int(f.read().strip()) / 1000000
Dec 00 00:48:19 steamdeck fancontrol.py[577]: ^^^^^^^^
Dec 00 00:48:19 steamdeck fancontrol.py[577]: PermissionError: [Errno 1] Operation not permitted
Dec 00 00:48:19 steamdeck systemd[1]: jupiter-fan-control.service: Main process exited, code=exited, status=1/FAILURE
Dec 00 00:48:19 steamdeck fancontrol.py[5887]: loaded critical temp from SSD hwmon: 79.85
Dec 00 00:48:19 steamdeck fancontrol.py[5887]: returning fan to EC control loop
Dec 00 00:48:19 steamdeck systemd[1]: jupiter-fan-control.service: Failed with result 'exit-code'.
Dec 00 00:48:19 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 15.142s CPU time.
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
Dec 00 00:48:20 steamdeck kernel: [drm] DMUB hardware initialized: version=0x0300000A
Dec 00 00:48:20 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 00 00:48:20 steamdeck kernel: [drm] kiq ring mec 2 pipe 1 q 0
Dec 00 00:48:20 steamdeck kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Dec 00 00:48:20 steamdeck kernel: [drm] JPEG decode initialized successfully.
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
Dec 00 00:48:20 steamdeck systemd[1]: jupiter-fan-control.service: Scheduled restart job, restart counter is at 1.
Dec 00 00:48:20 steamdeck systemd[1]: Stopped Jupiter fan control.
Dec 00 00:48:20 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 15.142s CPU time.
Dec 00 00:48:20 steamdeck systemd[1]: Started Jupiter fan control.
Dec 00 00:48:21 steamdeck fancontrol.py[5897]: loaded critical temp from SSD hwmon: 79.85
Dec 00 00:48:21 steamdeck fancontrol.py[5897]: jupiter-fan-control started successfully.
Dec 00 00:49:25 steamdeck dbus-daemon[572]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.174' (uid=1000 pid=5944 comm="timedatectl status")
Dec 00 00:49:25 steamdeck systemd[1]: Starting Time & Date Service...
Dec 00 00:49:25 steamdeck dbus-daemon[572]: [system] Successfully activated service 'org.freedesktop.timedate1'
Dec 00 00:49:25 steamdeck systemd[1]: Started Time & Date Service.
Dec 00 00:49:35 steamdeck systemd[1]: Created slice Slice /system/systemd-coredump.
Dec 00 00:49:35 steamdeck systemd[1]: Started Process Core Dump (PID 5980/UID 0).
Dec 00 00:49:35 steamdeck core_handler[5981]: Minidump generated at /var/lib/steamos-log-submitter/pending/minidump/.staging-1702676974-gamescope-xwm-3325-None.dmp
Dec 00 00:49:35 steamdeck kernel: input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb3/3-3/3-3:1.2/0003:28DE:1205.0003/input/input35
Dec 00 00:49:35 steamdeck systemd-coredump[5982]: Process 3325 (gamescope-wl) of user 1000 dumped core.
Stack trace of thread 3360:
#0 0x00007f9d5589f26c n/a (libc.so.6 + 0x8926c)
#1 0x00007f9d5584fa08 raise (libc.so.6 + 0x39a08)
#2 0x00007f9d55838538 abort (libc.so.6 + 0x22538)
#3 0x00007f9d5583845c n/a (libc.so.6 + 0x2245c)
#4 0x00007f9d558483d6 __assert_fail (libc.so.6 + 0x323d6)
#5 0x0000561db0f8cd97 n/a (gamescope + 0x7fd97)
#6 0x0000561db0f960ca n/a (gamescope + 0x890ca)
#7 0x0000561db0f658a0 n/a (gamescope + 0x588a0)
#8 0x0000561db0f67b3f n/a (gamescope + 0x5ab3f)
#9 0x0000561db0f82fac n/a (gamescope + 0x75fac)
#10 0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#11 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#12 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3325:
#0 0x00007f9d55913c0f __poll (libc.so.6 + 0xfdc0f)
#1 0x0000561db0f8555f n/a (gamescope + 0x7855f)
#2 0x0000561db0f2f446 n/a (gamescope + 0x22446)
#3 0x00007f9d55839850 n/a (libc.so.6 + 0x23850)
#4 0x00007f9d5583990a __libc_start_main (libc.so.6 + 0x2390a)
#5 0x0000561db0f51555 n/a (gamescope + 0x44555)
Stack trace of thread 3326:
#0 0x00007f9d55921266 epoll_wait (libc.so.6 + 0x10b266)
#1 0x0000561db0f73bcf n/a (gamescope + 0x66bcf)
#2 0x0000561db0f77424 n/a (gamescope + 0x6a424)
#3 0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#4 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#5 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3328:
#0 0x00007f9d55913c0f __poll (libc.so.6 + 0xfdc0f)
#1 0x0000561db0f84987 n/a (gamescope + 0x77987)
#2 0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#3 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#4 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3330:
#0 0x00007f9d558e59e5 clock_nanosleep (libc.so.6 + 0xcf9e5)
#1 0x00007f9d558ea5e7 __nanosleep (libc.so.6 + 0xd45e7)
#2 0x00007f9d54100455 n/a (libvulkan_radeon.so + 0x100455)
#3 0x00007f9d5425c7cc n/a (libvulkan_radeon.so + 0x25c7cc)
#4 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#5 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3359:
#0 0x00007f9d55913c0f __poll (libc.so.6 + 0xfdc0f)
#1 0x0000561db0fa99b2 n/a (gamescope + 0x9c9b2)
#2 0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#3 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#4 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3362:
#0 0x00007f9d558e59e5 clock_nanosleep (libc.so.6 + 0xcf9e5)
#1 0x00007f9d558ea5e7 __nanosleep (libc.so.6 + 0xd45e7)
#2 0x0000561db0f85037 n/a (gamescope + 0x78037)
#3 0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#4 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#5 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3358:
#0 0x00007f9d55921266 epoll_wait (libc.so.6 + 0x10b266)
#1 0x00007f9d48148579 n/a (libspa-support.so + 0x13579)
#2 0x00007f9d4813bbe3 n/a (libspa-support.so + 0x6be3)
#3 0x00007f9d55eb026f n/a (libpipewire-0.3.so.0 + 0x4126f)
#4 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#5 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3329:
#0 0x00007f9d55899f0e n/a (libc.so.6 + 0x83f0e)
#1 0x00007f9d5589c7a0 pthread_cond_wait (libc.so.6 + 0x867a0)
#2 0x00007f9d5425c89e n/a (libvulkan_radeon.so + 0x25c89e)
#3 0x00007f9d54239e0c n/a (libvulkan_radeon.so + 0x239e0c)
#4 0x00007f9d5425c7cc n/a (libvulkan_radeon.so + 0x25c7cc)
#5 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#6 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
Stack trace of thread 3361:
#0 0x00007f9d5590f900 __open64 (libc.so.6 + 0xf9900)
#1 0x0000561db0f5dbe5 n/a (gamescope + 0x50be5)
#2 0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#3 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
#4 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
ELF object binary architecture: AMD x86-64
Dec 00 00:49:35 steamdeck systemd[1]: [email protected]: Deactivated successfully.
I'll retrieve the dumps to provide them.
Steps for reproducing this issue:
- Play a game for a while on 3.5.11
- See the driver crash at some point with the image freezing on the screen
- The screen goes black after a while and the frozen image returns
- Gamescope recovers somewhat with the steam menu being visible under the frozen image
- Crashes again to a black screen
- The gamescope session restarts after a timeout
These are the gamescope minidumps. 1702672557-gamescope-xwm-942-None.dmp 1702673110-gamescope-xwm-3848-None.dmp 1702676974-gamescope-xwm-3325-None.dmp 1702586226-gamescope-xwm-1007-None.dmp
Might be related:
https://gitlab.freedesktop.org/drm/amd/-/issues/2220
Long standing power management issue in AMD video drivers.
The amdgpu driver will most likely be fixed if this problem is something which can be fixed in the software.
The logs show that some sensors were reporting temperatures much higher than what is considered normal. That's what happened when the Steam Deck was connected to its power brick.
I didn't reproduce these failures without having the device connected to the power brick. To be fair, I didn't have much time since reporting the bug to play with the Steam Deck to reproduce the crashes. Is it temporary or is it some kind of permanent damage? I couldn't say.
If the temperatures are this high when the ambient temperature is just 21-22 degrees Celsius, I don't want to think how it'll behave in the summer.
Given the failures seen in the logs, I wonder how Valve does testing. Are there unit tests, integration tests and end to end tests run for these software components which run on SteamOS? I ask because I know there was a tool used to test the GPU drivers a while ago. Perhaps some similar testing can be done for other components (software or hardware) or is already being done.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1862800940
Saw the temp warning, but I would not say it's the cause of your crash to be honest. The Deck APU should be rated for 105 °C max, so you are quite still in the max. If it was actually overheating, it would just hard shut down to protect itself.
I would blame driver far before the actual hardware, especially if it's on a single title and not across the board.
I'll post an update with the logs and the dumps if it happens again, regardless of the game.
@unclejack there is most likely another error earlier in the logs.
Those errors you posted are currently happening on 3.5 after a GPU reset happens. Gamescope and fan control need to better handle gpu resets, but those errors are probably unrelated to the root cause of the issue which is the game/gpu driver submitting a command that is hanging the gpu.
Check for errors earlier in the log from the amdgpu driver. It will probably have more details.
Do I recall correctly that you also had issues with your unit earlier in the year?
@lostgoat: That was a different unit which has gone through RMA.
The amdgpu crashed yesterday again. I'll grab the kernel's logs using journalctl and the new dumps to post them here. If there's something else I can do to collect logs or have some kind of traces for the commands which run on the GPU, I'd give that a shot. I can imagine this is likely a common amdgpu bug, not something specific to the rdna2 iGPU from the Steam Deck's SoC.
I've found several other such crashes in the logs:
Dec 15 22:44:55 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 15 22:45:09 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 15 22:45:09 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=7573, emitted seq=7575
Dec 15 22:45:09 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Dec 15 22:45:09 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Dec 15 22:45:09 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Dec 15 22:45:09 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 23 22:31:25 steamdeck fancontrol.py[556]: Warning: CPU temperature of 94.0 greater than max 90! Setting fan to max speed.
Dec 23 22:31:26 steamdeck fancontrol.py[556]: Warning: CPU temperature of 92.8 greater than max 90! Setting fan to max speed.
Dec 23 22:31:27 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:33:15 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.2 greater than max 90! Setting fan to max speed.
Dec 23 22:48:50 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:51 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:51 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:48:52 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:52 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:48:53 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:54 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:55 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:56 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:57 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:58 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:59 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:00 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:01 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:02 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:03 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:04 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:05 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:06 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:07 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:08 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:09 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:10 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:11 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:12 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:13 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:14 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:15 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:16 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:17 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:31 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:32 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:33 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:34 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:35 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:36 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:37 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:38 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:38 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:39 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:40 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:41 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:10 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:11 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:12 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:13 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:14 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:15 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:16 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:17 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:18 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:19 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:20 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:21 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:22 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:23 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:24 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:25 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:26 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:55 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:50:56 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.4 greater than max 90! Setting fan to max speed.
Dec 23 22:55:25 steamdeck fancontrol.py[556]: Warning: CPU temperature of 92.6 greater than max 90! Setting fan to max speed.
Dec 23 22:55:26 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:58:58 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 23 22:58:58 steamdeck kernel: perf: interrupt took too long (2546 > 2500), lowering kernel.perf_event_max_sample_rate to 78300
Dec 23 23:01:12 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.4 greater than max 90! Setting fan to max speed.
Dec 23 23:01:13 steamdeck fancontrol.py[556]: Warning: CPU temperature of 97.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:13 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:01:14 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.8 greater than max 90! Setting fan to max speed.
Dec 23 23:01:14 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:01:15 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:16 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:17 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.8 greater than max 90! Setting fan to max speed.
Dec 23 23:01:18 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:19 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:20 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:21 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:22 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:23 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:24 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:25 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:26 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:27 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:28 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:29 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:30 steamdeck fancontrol.py[556]: Warning: CPU temperature of 92.2 greater than max 90! Setting fan to max speed.
Dec 23 23:01:31 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:32 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:33 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:34 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:35 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:36 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:37 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:38 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:39 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:40 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:41 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:42 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:43 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:44 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:45 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:46 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:47 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:48 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:49 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:50 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:51 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:52 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:53 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:54 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:55 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:56 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:57 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:58 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:59 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:00 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:01 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:02 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:03 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:04 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:05 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:23 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:24 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:25 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:26 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:27 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:28 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:29 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:30 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:31 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:32 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:33 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:34 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:35 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:36 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:37 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:38 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:39 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:07:13 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=21419, emitted seq=21421
Dec 23 23:07:13 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Dec 23 23:07:13 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Dec 23 23:07:13 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Dec 23 23:07:13 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 23 23:07:13 steamdeck kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
Dec 23 23:07:13 steamdeck kernel: [drm] PSP is resuming...
Dec 23 23:07:13 steamdeck (udev-worker)[3769]: devcd1: Process 'cat /sys/devices/virtual/devcoredump/devcd1/data > /var/lib/steamos-log-submitter/pending/devcoredump/4556' failed with exit code 1.
Dec 23 23:07:13 steamdeck kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
Dec 23 23:07:13 steamdeck fancontrol.py[556]: Traceback (most recent call last):
Dec 23 23:07:13 steamdeck fancontrol.py[556]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 542, in <module>
Dec 23 23:07:13 steamdeck fancontrol.py[556]: controller.loop_control()
Dec 23 23:07:13 steamdeck fancontrol.py[556]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 486, in loop_control
Dec 23 23:07:13 steamdeck fancontrol.py[556]: self.loop_read_sensors()
Dec 23 23:07:13 steamdeck fancontrol.py[556]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 452, in loop_read_sensors
Dec 23 23:07:13 steamdeck fancontrol.py[556]: self.power_sensor.get_avg_value()
Dec 23 23:07:13 steamdeck fancontrol.py[556]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 356, in get_avg_value
Dec 23 23:07:13 steamdeck fancontrol.py[556]: self.values.append(self.get_value())
Dec 23 23:07:13 steamdeck fancontrol.py[556]: ^^^^^^^^^^^^^^^^
Dec 23 23:07:13 steamdeck fancontrol.py[556]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 351, in get_value
Dec 23 23:07:13 steamdeck fancontrol.py[556]: self.value = int(f.read().strip()) / 1000000
Dec 23 23:07:13 steamdeck fancontrol.py[556]: ^^^^^^^^
Dec 23 23:07:13 steamdeck fancontrol.py[556]: PermissionError: [Errno 1] Operation not permitted
Dec 23 23:07:13 steamdeck systemd[1]: jupiter-fan-control.service: Main process exited, code=exited, status=1/FAILURE
Dec 23 23:07:13 steamdeck fancontrol.py[3774]: loaded critical temp from SSD hwmon: 79.85
Dec 23 23:07:13 steamdeck fancontrol.py[3774]: returning fan to EC control loop
Dec 23 23:07:13 steamdeck systemd[1]: jupiter-fan-control.service: Failed with result 'exit-code'.
Dec 23 23:07:13 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 12.569s CPU time.
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
Dec 23 23:07:14 steamdeck kernel: [drm] DMUB hardware initialized: version=0x0300000A
Dec 23 23:07:14 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 23 23:07:14 steamdeck kernel: [drm] kiq ring mec 2 pipe 1 q 0
Dec 23 23:07:14 steamdeck kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Dec 23 23:07:14 steamdeck kernel: [drm] JPEG decode initialized successfully.
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
There are no other messages logged between the first and the last logged message for the last crash from yesterday. I'm not sure what's going on with the battery to cause those errors and why the fan control fails the way it does.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1868499713
Fan service dies since it can't poll GPU temperatures anymore since the driver died, if I have to guess. It should be irrelevant. The crash to me looks like the linked above issue tho. Might be worth to include the patch provided by Mario in that thread, it should be easily back portable to 6.1 LTS.
These are the two potential patches: https://lore.kernel.org/amd-gfx/[email protected]/T/#u https://gitlab.freedesktop.org/drm/amd/uploads/b77399cdff3f6e7206dba43527804978/0001-drm-amdgpu-adjust-SDMA-timeout.patch
I see there are multiple patches which are either being prepared or are already committed. Several patches seem to be necessary. There's also a new binary and source kernel package from Valve. I'll give this a few days and maybe try those patches to check if they help.
@unclejack can you elaborate where you found the updated kernel package and how we can check if the patches are applied there? Since 3.5.x some games aren't playable anymore as my steam deck randomly hard locks and I fear that there might be no update/hotfix from Valve in the next weeks, as from my understand these patches are only workarounds to mitigate the hard crashes till the originating problem is found.
@unclejack can you elaborate where you found the updated kernel package and how we can check if the patches are applied there? Since 3.5.x some games aren't playable anymore as my steam deck randomly hard locks and I fear that there might be no update/hotfix from Valve in the next weeks, as from my understand these patches are only workarounds to mitigate the hard crashes till the originating problem is found.
You can find the latest kernel sources build packages here: https://gitlab.com/evlaV/jupiter-PKGBUILD/-/tree/master/linux-neptune-61?ref_type=heads
Including them inside the sources in a .patch filename will pull them in before compilation.
More details here: https://wiki.archlinux.org/title/PKGBUILD
@hrvylein: That appears to be the kernel included in the latest 3.5.12 preview update. It probably doesn't make sense to try to build from sources to apply these patches if you don't encounter the crashes.
The 3.5.12 preview update also crashed after a while. This particular crash didn't close the game for some reason. I had to kill it by hand, thus taking down gamescope once again after the initial crash.
The odd thing this time was that the crash didn't occur as quickly as it did last time. It also didn't crash the day before. The logs were the same on the kernel side. There were no new details provided.
added later:
[13586.590300] wlan0: associated
[13587.145112] rtw_8822ce 0000:03:00.0: failed to get tx report from firmware
[13722.466653] wlan0: disconnect from AP <AP 2.4 GHz> for new auth to <AP 5GHz>
[13722.543243] wlan0: authenticate with <AP 5GHz>
[13723.007507] wlan0: send auth to <AP 5GHz> (try 1/3)
[13723.010844] wlan0: authenticated
[13723.012637] wlan0: associate with <AP 5GHz> (try 1/3)
[13723.017413] wlan0: RX ReassocResp from <AP 5GHz> (capab=0x111 status=0 aid=1)
[13723.017773] wlan0: associated
[13723.062862] wlan0: Limiting TX power to 17 (20 - 3) dBm as advertised by <AP 5GHz>
[13729.230559] wlan0: disconnect from AP <AP 5GHz> for new auth to <AP 2.4 GHz>
[13729.393292] wlan0: authenticate with <AP 2.4 GHz>
[13729.393335] wlan0: 80 MHz not supported, disabling VHT
[13729.837556] wlan0: send auth to <AP 2.4 GHz> (try 1/3)
[13729.840968] wlan0: authenticated
[13729.842668] wlan0: associate with <AP 2.4 GHz> (try 1/3)
[13729.847447] wlan0: RX ReassocResp from <AP 2.4 GHz> (capab=0x431 status=0 aid=3)
[13729.847795] wlan0: associated
[13741.520209] wlan0: disconnect from AP <AP 2.4 GHz> for new auth to <AP 5GHz>
[13741.636713] wlan0: authenticate with <AP 5GHz>
[13742.104284] wlan0: send auth to <AP 5GHz> (try 1/3)
[13742.107619] wlan0: authenticated
[13742.109417] wlan0: associate with <AP 5GHz> (try 1/3)
[13742.114130] wlan0: RX ReassocResp from <AP 5GHz> (capab=0x111 status=0 aid=1)
[13742.114454] wlan0: associated
[13742.212310] wlan0: Limiting TX power to 17 (20 - 3) dBm as advertised by <AP 5GHz>
[13748.303767] wlan0: disconnect from AP <AP 5GHz> for new auth to <AP 2.4 GHz>
[13748.410068] wlan0: authenticate with <AP 2.4 GHz>
[13748.410083] wlan0: 80 MHz not supported, disabling VHT
[13748.782948] rtw_8822ce 0000:03:00.0: failed to do dpk calibration
The dpk calibration and the failed to get tx report from firmware errors don't look too good. Please let me know if a new ticket should be opened for that.
The Steam Deck crashed again. It didn't recover this time. It just sat with the game on screen frozen.
@unclejack I didn't have the time to look into the whole kernel thing and I didn't build a linux kernel before. Setting up the build chain is for sure challenging.
What I don't understand is, that most of the games are still perfectly fine on my steam deck, while one might crash randomly using Proton GE 8.25 and one with a native linux build will crash in the first 10 minutes of the game. Other games (verified for deck, didn't touch proton settings or added anything to startup cmd) run for hours and days. Do you encounter the same or are the crashes evenly distributed and there is no difference in the games played? I really can't make up if the device is faulty or it's steam os.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1871302156
If you don't have any wireless issues they should be irrelevant. These happens also on my unit often, and realtek didn't aknowledged them in another unrelated issue with wireless, so they might be just a red herring.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1872276023
If the crash looks the same as @unclejack than yeah, it's a random issue that happens when the GPU is temporarily in GFXOFF state. It should happen with lighter games. These patches above should help then.
In Desktop mode I was presented with a mesa 23.3.0 refresh update (2 packages and 1 Mesa extra Update) but I think Game mode uses 23.1.x If i'm right?
In Desktop mode I was presented with a mesa 23.3.0 refresh update (2 packages and 1 Mesa extra Update) but I think Game mode uses 23.1.x If i'm right?
Yeah, that's just the MESA flatpak version, it's unrelated from the system version. You can check it from the system menu in game mode.
Im trying to provoke the crash with a native linux game in game mode. Where is the crash log stored when mit using proton_log?
@hrvylein: Please keep in mind that I'm not a specialist when it comes to graphics programming, graphics APIs and GPU drivers. This is an explanation. Let's imagine we write a program which we test with some given data. One day another user gives it different input data. The program crashes because we didn't do a thorough job with the code which is supposed to handle all possible valid data.
A game which crashes right away or in 10 minutes in a very deterministic manner is a good thing. This means that there's a bug which can be reproduced and fixes can be tested easily. Some games which crash only after some time may leak memory (memory which they allocate, use, fail to free and they keep allocating more until they crash).
Then there are games which cause a GPU driver crash because the GPU resets. This might be a problem with handling some specific sequence of commands or some other issue in the GPU driver. This last type of issue is much harder to fix and more frustrating. It can be something one encounters every 20-30 minutes or it might be something one encounters once every 2-3 days when playing 1-2 hours every day. It is also very dependent on the game. Some games might run for longer than any human can play them without any issue.
One way to solve such bugs might be to do fuzzing for the graphics APIs. Commands sent to the GPU could be recorded and played back to reproduce the failures. An alternative would be to generate a large number of such valid sequences of commands in an attempt to crash the driver. This has been done already to find CPU bugs and undefined behavior in CPUs.
The different proton versions can carry game specific patches for wine itself, for vkd3d or for dxvk. Some games can also crash on startup due to missing codecs, DLLs or other problems. Once again, deterministic crashes are better than the random ones.
You can retrieve the logs via journalctl for steam and dmesg should give you the kernel's log. The kernel's log will provide details such as the ones I've posted above. This will also be present in journalctl. The kernel logs from journalctl will be mixed with those from many other services.
@RodoMa92: These logged errors might not be relevant for the wireless issues. I do seem to have issues with Steam thinking it's not connected to the Internet from time to time and one specific game complaining that it's disconnected from the game's servers.
... @RodoMa92: These logged errors might not be relevant for the wireless issues. I do seem to have issues with Steam thinking it's not connected to the Internet from time to time and one specific game complaining that it's disconnected from the game's servers.
What wireless network hardware do you have at home? I had your same symptoms randomly while using an AP with ath10k ac wireless (Archer C7) on OpenWRT but since I got too much annoyed from Realtek/Valve not acting on it/caring enough I swapped the AP with a MT6915 AX radio + OpenWRT for 50 bucks and I haven't been able to reproduce it yet again, although my issues has gone down in frequency since I left SteamOS for a more recent Linux kernel (it was unusable on 6.1 LTS).
Feel free to ping them here: https://bugzilla.kernel.org/show_bug.cgi?id=217782
@RodoMa92: I had a similar setup with C7 v2. What's the other hardware with the mediatek radio? I might also change it to check if that helps. Realtek wifi chips are really poor choices for devices with wifi. The ath11k found in the OLED Deck also has most of its control buried in its binary firmware blob as far as I know. The mediatek mt76 based devices are likely to be the best choice.
It's unfortunate that the wifi is soldered to the mainboard of the Steam Deck. There's absolutely no straightforward way to replace it. It's only possible to replace it with a soldering gun.
Regarding your ONT, perhaps you can put that in bridge mode to get rid of the ONT's wifi and use your router instead for routing duties.
By the way, you can install SteamOS on an external drive using the recovery image on another computer. That should help with testing.
As for the ticket itself and the GPU driver crashes, it seems fuzzing has been done already to some extent for various drivers. The developers and the people who hack on mesa are probably familiar with all the tools used for fuzzing GPU drivers. Perhaps traces could be collected from games running via vkd3d and dxvk to generate extremely long sequences of commands derived from the traces. The tools I've found may already be superseded by other tools: https://github.com/google/graphicsfuzz.
If anyone has recommendations for what traces to collect, please let me know.
@RodoMa92: I had a similar setup with C7 v2. What's the other hardware with the mediatek radio? I might also change it to check if that helps. Realtek wifi chips are really poor choices for devices with wifi. The ath11k found in the OLED Deck also has most of its control buried in its binary firmware blob as far as I know. The mediatek mt76 based devices are likely to be the best choice.
Yeah, Qualcomm is even worse in that regard, the "open source" driver part is basically just a shim over 6 MB of black box stuff. At least with Realtek it's only 250 KB. The other router is a broadcom AC provided router. I didn't believe that Mediatek of all hardware manufacturer would be the best choice as open source support goes. Feel free to test the C7 and give feedback to Realtek on your results, maybe they'll act on it. I already contacted them by saying that you are the third user with stalls and a Archer C7, so I would guess that the issue is with their firmware.
It's unfortunate that the wifi is soldered to the mainboard of the Steam Deck. There's absolutely no straightforward way to replace it. It's only possible to replace it with a soldering gun.
Yep, otherwise I would already have thrown it into the garbage as of right now.
Regarding your ONT, perhaps you can put that in bridge mode to get rid of the ONT's wifi and use your router instead for routing duties.
I would avoid the double box if I can, but here where I leave the ISP has the duty to leave you the choice of the router, so they have to provide the ont to ethernet themselves for free. Problem is that I would also need a POTS compatible router, which basically do not exists and then I would still have two separate boxes again.
As for the ticket itself and the GPU driver crashes, it seems fuzzing has been done already to some extent for various drivers. The developers and the people who hack on mesa are probably familiar with all the tools used for fuzzing GPU drivers. Perhaps traces could be collected from games running via vkd3d and dxvk to generate extremely long sequences of commands derived from the traces. The tools I've found may already be superseded by other tools: https://github.com/google/graphicsfuzz.
If anyone has recommendations for what traces to collect, please let me know.
Fuzzing graphics hardware is hard since given the numbers of API calls you can't really do much checking on the actual parameters IIRC, so some crashes/errors are kinda expected if the call itself is malformed. However it should still recover gracefully and not die completely like here for sure.
I use a C7 V2, one correction I would make here is on what wifi chip you think it uses. It uses Atheros for both bands.
I don't use 5 GHz though, and mine has been on OpenWRT firmware for years now. You could try that.
For wireless issues, either create a new thread or if it's the LCD feel free to add them here: https://github.com/ValveSoftware/SteamOS/issues/1119
[ 84.072605] [drm] Failed to add display topology, DTM TA is not initialized.
[ 104.041788] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=5606, emitted seq=5610
[ 104.042370] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 104.042919] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[ 104.152539] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[ 104.162714] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 104.163235] [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
[ 104.163314] [drm] PSP is resuming...
[ 104.185505] [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
[ 105.046797] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
[ 105.047768] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[ 105.057815] [drm] DMUB hardware initialized: version=0x0300000A
[ 105.133882] [drm] Failed to add display topology, DTM TA is not initialized.
[ 105.145944] [drm] kiq ring mec 2 pipe 1 q 0
[ 105.148101] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 105.148400] [drm] JPEG decode initialized successfully.
[ 105.148405] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 105.148408] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 105.148409] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 105.148410] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 105.148411] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 105.148412] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 105.148413] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 105.148414] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 105.148415] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 105.148416] amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 105.148417] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 105.148418] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[ 105.148419] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[ 105.148420] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[ 105.148421] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 105.151616] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
[ 105.151618] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
[ 105.151629] amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
[ 141.038201] input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb3/3-3/3-3:1.2/0003:28DE:1205.0003/input/input26
The driver crashed after starting the game from the main Steam Deck screen right after a cold boot. The Steam Deck was stored completely shut down in its case before being turned on. Gamescope crashed and the whole system recovered after a while.
[ 236.453625] cs35l41 spi-VLV1776:01: DSP1: Firmware version: 3
[ 236.453637] cs35l41 spi-VLV1776:01: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[ 236.453656] cs35l41 spi-VLV1776:00: DSP1: Firmware version: 3
[ 236.453667] cs35l41 spi-VLV1776:00: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[ 236.712319] cs35l41 spi-VLV1776:01: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[ 236.712532] cs35l41 spi-VLV1776:00: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[ 236.712928] cs35l41 spi-VLV1776:01: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[ 236.712934] cs35l41 spi-VLV1776:01: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[ 236.712941] cs35l41 spi-VLV1776:01: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[ 236.713179] cs35l41 spi-VLV1776:00: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[ 236.713186] cs35l41 spi-VLV1776:00: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[ 236.713192] cs35l41 spi-VLV1776:00: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[ 236.761467] cs35l41 spi-VLV1776:01: DSP1: Legacy support not available
[ 236.763073] cs35l41 spi-VLV1776:00: DSP1: Legacy support not available
[ 237.327711] cs35l41 spi-VLV1776:00: DSP1: Firmware version: 3
[ 237.327721] cs35l41 spi-VLV1776:00: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[ 237.479320] cs35l41 spi-VLV1776:00: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[ 237.479675] cs35l41 spi-VLV1776:00: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[ 237.479682] cs35l41 spi-VLV1776:00: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[ 237.479689] cs35l41 spi-VLV1776:00: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[ 237.505821] cs35l41 spi-VLV1776:01: DSP1: Firmware version: 3
[ 237.505831] cs35l41 spi-VLV1776:01: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[ 237.644561] cs35l41 spi-VLV1776:01: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[ 237.644810] cs35l41 spi-VLV1776:01: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[ 237.644815] cs35l41 spi-VLV1776:01: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[ 237.644820] cs35l41 spi-VLV1776:01: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[ 242.623372] input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb3/3-3/3-3:1.2/0003:28DE:1205.0003/input/input27
[ 243.928525] input: Microsoft X-Box 360 pad 0 as /devices/virtual/input/input28
[ 244.493639] systemd-gpt-auto-generator[3331]: EFI loader partition unknown, exiting.
[ 244.493646] systemd-gpt-auto-generator[3331]: (The boot loader did not set EFI variable LoaderDevicePartUUID.)
[ 244.982187] systemd-gpt-auto-generator[3357]: EFI loader partition unknown, exiting.
[ 244.982197] systemd-gpt-auto-generator[3357]: (The boot loader did not set EFI variable LoaderDevicePartUUID.)
[ 246.224499] cs35l41 spi-VLV1776:01: DSP1: Legacy support not available
[ 246.225712] cs35l41 spi-VLV1776:00: DSP1: Legacy support not available
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1873379305
This looks identical to the amd freedesktop I've reported above. Did you have applied the above mentioned patches? Still the same issue then?
Had no crash for some days in a row and all of a sudden I have the crash again. Crash log looks familiar ...
Jan 02 21:35:19 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=4775456, emitted seq=4775458
Jan 02 21:35:19 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process
Jan 02 21:35:19 kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Jan 02 21:35:19 kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Jan 02 21:35:19 kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 02 21:35:19 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
Jan 02 21:35:19 kernel: [drm] PSP is resuming...
Jan 02 21:35:19 (udev-worker)[17488]: devcd1: Process 'cat /sys/devices/virtual/devcoredump/devcd1/data > /var/lib/steamos-log-submitter/pending/devcoredump/4903' failed with exit code 1.
Jan 02 21:35:19 kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
Jan 02 21:35:20 kernel: [drm] DMUB hardware initialized: version=0x0300000A
Jan 02 21:35:20 kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Jan 02 21:35:20 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jan 02 21:35:20 kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jan 02 21:35:20 kernel: [drm] JPEG decode initialized successfully.
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
Jan 02 21:35:20 kernel: [drm] Skip scheduling IBs!
Jan 02 21:35:20 kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(2) succeeded!
I would really like to test the kernel patches.
Just got the same randomly today while playing Terraria (low GPU usage, as expected), on my end however it recovered fine (but gamescope died anyway). Log below:
[10229.343630] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=34098, emitted seq=34100
[10229.344393] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[10229.344895] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[10229.436228] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[10229.446588] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[10229.447087] [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
[10229.447236] [drm] PSP is resuming...
[10229.469394] [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
[10230.155567] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
[10230.155874] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[10230.165827] [drm] DMUB hardware initialized: version=0x0300000A
[10230.244640] [drm] Failed to add display topology, DTM TA is not initialized.
[10230.258284] [drm] kiq ring mec 2 pipe 1 q 0
[10230.260215] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[10230.260666] [drm] JPEG decode initialized successfully.
[10230.260672] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[10230.260678] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[10230.260682] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[10230.260685] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[10230.260688] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[10230.260691] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[10230.260694] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[10230.260698] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[10230.260701] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[10230.260705] amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[10230.260708] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[10230.260711] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[10230.260715] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[10230.260718] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[10230.260722] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[10230.264045] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
[10230.264051] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
[10230.264073] amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
[10230.264087] [drm] Skip scheduling IBs!
I'm running the latest kernel, 6.6.8 on Bazzite, so it's not SteamOS dependent.