linux-cachyos icon indicating copy to clipboard operation
linux-cachyos copied to clipboard

AMD RDNA4: Ollama crashes when having service enabled, which crashes the boot process and makes a gpu reset

Open afterglow1284 opened this issue 2 months ago • 47 comments

After boot, GDM completely freezes without allowing any user interaction, including cursor movement, I have managed to make it work once by changing to a tty and back, but I cannot reproduce this anymore. I can confirm that this behavior isn't present when using the lts version of the kernel.

This is the log for GDM.service:

dic 02 08:05:08 computer systemd[1]: Starting GNOME Display Manager...
dic 02 08:05:08 computer systemd[1]: Started GNOME Display Manager.
dic 02 08:06:54 computer gdm-password][7175]: gkr-pam: unable to locate daemon control file
dic 02 08:06:54 computer gdm-password][7175]: gkr-pam: stashed password to try later in open session
dic 02 08:06:54 computer gdm-password][7175]: pam_unix(gdm-password:session): session opened for user username(uid=1000) by username(uid=0)
dic 02 08:06:54 computer gdm-password][7175]: gkr-pam: unlocked login keyring
dic 02 08:06:55 computer gdm[962]: Gdm: Child process -5718 was already dead.
dic 02 12:28:19 computer systemd[1]: Stopping GNOME Display Manager...
dic 02 12:28:19 computer gdm[962]: Gdm: Failed to list cached users: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name is not activatable
dic 02 12:28:20 computer systemd[1]: gdm.service: Main process exited, code=exited, status=1/FAILURE
dic 02 12:28:20 computer systemd[1]: gdm.service: Failed with result 'exit-code'.
dic 02 12:28:20 computer systemd[1]: Stopped GNOME Display Manager.
dic 02 12:28:20 computer systemd[1]: gdm.service: Triggering OnFailure= dependencies.
dic 02 12:28:20 computer systemd[1]: gdm.service: Failed to enqueue OnFailure=plymouth-quit.service job, ignoring: Transaction for plymouth-quit.service/start is destructive (systemd-journal-flush.service has 'stop' job queued, but 'start' is included in transaction).

afterglow1284 avatar Dec 04 '25 08:12 afterglow1284

Can you provide more informaition what kind of hardware this is?

ptr1337 avatar Dec 04 '25 08:12 ptr1337

sure, here is the output of inxi -Fxz:

System:
  Kernel: 6.12.60-2-cachyos-lts arch: x86_64 bits: 64 compiler: gcc v: 15.2.1
  Desktop: GNOME v: 49.2 Distro: CachyOS base: Arch Linux
Machine:
  Type: Desktop Mobo: Gigabyte model: B650 EAGLE AX v: x.x
    serial: <superuser required> Firmware: UEFI vendor: American Megatrends LLC.
    v: F35 date: 07/01/2025
CPU:
  Info: 8-core model: AMD Ryzen 7 9700X bits: 64 type: MT MCP arch: Zen 5
    rev: 0 cache: L1: 640 KiB L2: 8 MiB L3: 32 MiB
  Speed (MHz): avg: 3713 min/max: 600/5581 boost: enabled cores: 1: 3713
    2: 3713 3: 3713 4: 3713 5: 3713 6: 3713 7: 3713 8: 3713 9: 3713 10: 3713
    11: 3713 12: 3713 13: 3713 14: 3713 15: 3713 16: 3713 bogomips: 121364
  Flags-basic: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a
    ssse3 svm
Graphics:
  Device-1: Advanced Micro Devices [AMD/ATI] Navi 48 [Radeon RX 9070/9070
    XT/9070 GRE] vendor: Tul / PowerColor Reaper driver: amdgpu v: kernel
    arch: RDNA-4 bus-ID: 03:00.0
  Device-2: Advanced Micro Devices [AMD/ATI] Granite Ridge [Radeon Graphics]
    vendor: Gigabyte driver: amdgpu v: kernel arch: RDNA-2 bus-ID: 12:00.0
    temp: 33.0 C
  Display: wayland server: X.Org v: 24.1.9 with: Xwayland v: 24.1.9
    compositor: gnome-shell driver: X: loaded: amdgpu unloaded: modesetting
    dri: radeonsi gpu: amdgpu resolution: 2560x1440~60Hz
  API: EGL v: 1.5 drivers: radeonsi,swrast platforms:
    active: gbm,wayland,x11,surfaceless,device inactive: N/A
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 25.2.7-cachyos1.2
    glx-v: 1.4 direct-render: yes renderer: AMD Radeon RX 9070 XT (radeonsi
    gfx1201 LLVM 21.1.5 DRM 3.61 6.12.60-2-cachyos-lts)
  API: Vulkan v: 1.4.328 drivers: radv surfaces: N/A devices: 2
  Info: Tools: api: eglinfo, glxinfo, vulkaninfo gpu: lact
    x11: xdpyinfo, xprop, xrandr
Audio:
  Device-1: Advanced Micro Devices [AMD/ATI] Navi 48 HDMI/DP Audio
    driver: snd_hda_intel v: kernel bus-ID: 03:00.1
  Device-2: Advanced Micro Devices [AMD/ATI] Radeon High Definition Audio
    [Rembrandt/Strix] driver: snd_hda_intel v: kernel bus-ID: 12:00.1
  Device-3: Advanced Micro Devices [AMD] Family 17h/19h/1ah HD Audio
    vendor: Gigabyte driver: snd_hda_intel v: kernel bus-ID: 12:00.6
  API: ALSA v: k6.12.60-2-cachyos-lts status: kernel-api
  Server-1: sndiod v: N/A status: off
  Server-2: JACK v: 1.9.22 status: off
  Server-3: PipeWire v: 1.4.9 status: active
Network:
  Device-1: Realtek RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet
    vendor: Gigabyte driver: r8169 v: kernel port: e000 bus-ID: 08:00.0
  IF: enp8s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  Device-2: Intel Wi-Fi 6E AX210/AX1675 2x2 [Typhoon Peak] driver: iwlwifi
    v: kernel bus-ID: 09:00.0
  IF: wlan0 state: down mac: <filter>
  IF-ID-1: tailscale0 state: unknown speed: -1 duplex: full mac: N/A
  IF-ID-2: tun0 state: unknown speed: 10000 Mbps duplex: full mac: N/A
Bluetooth:
  Device-1: Intel AX210 Bluetooth driver: btusb v: 0.8 type: USB bus-ID: 1-7:3
  Report: btmgmt ID: hci0 rfk-id: 0 state: up address: <filter> bt-v: 5.4
    lmp-v: 13
Drives:
  Local Storage: total: 3.67 TiB used: 1 TiB (27.4%)
  ID-1: /dev/nvme0n1 vendor: Western Digital model: WD BLACK SN850X 1000GB
    size: 931.51 GiB temp: 36.9 C
  ID-2: /dev/sda vendor: Samsung model: SSD 860 EVO 1TB size: 931.51 GiB
  ID-3: /dev/sdb vendor: Seagate model: ST1000DM003-1SB102 size: 931.51 GiB
  ID-4: /dev/sdc vendor: Seagate model: ST1000DM003-1ER162 size: 931.51 GiB
  ID-5: /dev/sdd vendor: SanDisk model: Ultra USB 3.0 size: 28.65 GiB
    type: USB
Partition:
  ID-1: / size: 839.87 GiB used: 278.02 GiB (33.1%) fs: btrfs
    dev: /dev/nvme0n1p2
  ID-2: /boot size: 2 GiB used: 255.3 MiB (12.5%) fs: vfat
    dev: /dev/nvme0n1p1
  ID-3: /home size: 839.87 GiB used: 278.02 GiB (33.1%) fs: btrfs
    dev: /dev/nvme0n1p2
  ID-4: /var/log size: 839.87 GiB used: 278.02 GiB (33.1%) fs: btrfs
    dev: /dev/nvme0n1p2
  ID-5: /var/tmp size: 839.87 GiB used: 278.02 GiB (33.1%) fs: btrfs
    dev: /dev/nvme0n1p2
Swap:
  ID-1: swap-1 type: zram size: 30.46 GiB used: 0 KiB (0.0%) dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 40.6 C mobo: 32.0 C
  Fan Speeds (rpm): N/A
  GPU: device: amdgpu temp: 33.0 C device: amdgpu temp: 36.0 C fan: 0
Info:
  Memory: total: 32 GiB note: est. available: 30.46 GiB used: 4.83 GiB (15.9%)
  Processes: 480 Uptime: 18m Init: systemd
  Packages: 1558 Compilers: clang: 21.1.6 gcc: 15.2.1 Shell: Zsh v: 5.9
    inxi: 3.3.40

afterglow1284 avatar Dec 04 '25 08:12 afterglow1284

I have tried waiting for a while and gdm seems to start working after about a minute. I have also checked on an intel lunar lake laptop and I cannot replicate this issue. Here the gdm log is:

dic 04 09:49:32 x9-15 systemd[1]: Starting GNOME Display Manager...
dic 04 09:49:32 x9-15 systemd[1]: Started GNOME Display Manager.
dic 04 09:49:50 x9-15 gdm-fingerprint][4180]: pam_unix(gdm-fingerprint:session): session opened for user username(uid=1000) by username(uid=0)
dic 04 09:49:51 x9-15 gdm-fingerprint][4180]: gkr-pam: couldn't unlock the login keyring.
dic 04 09:49:52 x9-15 gdm[809]: Gdm: Child process -3157 was already dead.

The only error is for the login keyring, but it is due to fprintd

afterglow1284 avatar Dec 04 '25 08:12 afterglow1284

dic 02 12:28:20 computer systemd[1]: Stopped GNOME Display Manager. dic 02 12:28:20 computer systemd[1]: gdm.service: Triggering OnFailure= dependencies. dic 02 12:28:20 computer systemd[1]: gdm.service: Failed to enqueue OnFailure=plymouth-quit.service job, ignoring: Transaction for plymouth-quit.service/start is destructive (systemd-journal-flush.service has 'stop' job queued, but 'start' is included in transaction)

Seems like it does not come up due plymouth

ptr1337 avatar Dec 04 '25 08:12 ptr1337

Can you provide a full sudo cachyos-bugreport.sh when booting with the 6.18 kernel?

ptr1337 avatar Dec 04 '25 08:12 ptr1337

Here is the output of that command: https://paste.cachyos.org/p/7711106.log

afterglow1284 avatar Dec 04 '25 08:12 afterglow1284

Here is the output of that command: https://paste.cachyos.org/p/7711106.log

Can you please try "linux-cachyos" instead of "linux-cachyos-bore"

pkgrel should be -3

ptr1337 avatar Dec 04 '25 09:12 ptr1337

I have the same issue, it starts frozen and after about a minute the screen turns completely black before showing a working gdm The gdm log is still:

dic 04 10:05:08 titan systemd[1]: Starting GNOME Display Manager...
dic 04 10:05:08 titan systemd[1]: Started GNOME Display Manager.
dic 04 10:05:46 titan gdm[966]: Gdm: Child process -3299 was already dead.
dic 04 10:05:46 titan gdm[966]: Gdm: Child process -3299 was already dead.
dic 04 10:05:52 titan gdm-password][6624]: gkr-pam: unable to locate daemon control file
dic 04 10:05:52 titan gdm-password][6624]: gkr-pam: stashed password to try later in open session
dic 04 10:05:52 titan gdm-password][6624]: pam_unix(gdm-password:session): session opened for user daniele(uid=1000) by daniele(uid=0)
dic 04 10:05:52 titan gdm-password][6624]: gkr-pam: unlocked login keyring
dic 04 10:05:53 titan gdm[966]: Gdm: Child process -5114 was already dead.

and the only thing marked as an error was this line: gkr-pam: unable to locate daemon control file

afterglow1284 avatar Dec 04 '25 09:12 afterglow1284

Would you mind to try to disable the ollama.service for one boot?

ptr1337 avatar Dec 04 '25 09:12 ptr1337

now it seems to be working fine, I will just check with other kernels to be sure

afterglow1284 avatar Dec 04 '25 09:12 afterglow1284

So, following is happening I think:

  • Boot
  • Plymouth is coming up
  • Ollama.service starts and crashing the gpu
  • gdm waiting for plymouth to be done, but plymouth is crashed
  • Waiting till it recovers

ptr1337 avatar Dec 04 '25 09:12 ptr1337

After checking both linux-cachyos and linux-cachyos-bore it seems like the issue is gone after disabling ollama

afterglow1284 avatar Dec 04 '25 09:12 afterglow1284

it just seems strange that it did not have any issue with the lts kernel

afterglow1284 avatar Dec 04 '25 09:12 afterglow1284

Could be just amdgpu regression in 6.18

ptr1337 avatar Dec 04 '25 09:12 ptr1337

Indeed, I have just tried doing the same on an Intel igpu with the service and i cannot replicate the issue

afterglow1284 avatar Dec 04 '25 09:12 afterglow1284

Should I edit the report to make clear that the ollama service seems to be the cause of the issue?

afterglow1284 avatar Dec 04 '25 09:12 afterglow1284

Can you do a little bit more isolation? What all updated at the same time? Did you upgrade kernel and linux-firmware at same time and this started?

What version of ROCm do you have Ollama set up with?

superm1 avatar Dec 04 '25 14:12 superm1

from the pacman logs these are the upgraded packages:

[2025-12-04T09:17:43+0100] [ALPM] upgraded cachyos-ananicy-rules (1:1.1.12-1 -> 1:1.1.15-1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded zlib-ng (2.2.5-1 -> 2.3.2-2)
[2025-12-04T09:17:43+0100] [ALPM] upgraded zlib-ng-compat (2.2.5-1 -> 2.3.2-2)
[2025-12-04T09:17:43+0100] [ALPM] upgraded libpng (1.6.51-1.1 -> 1.6.52-1.1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded ffmpegthumbnailer (2.2.3-4.5 -> 2.2.4-1.1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded go (2:1.25.4-2 -> 2:1.25.5-2)
[2025-12-04T09:17:43+0100] [ALPM] upgraded lib32-zlib-ng (2.2.5-1 -> 2.3.2-2)
[2025-12-04T09:17:43+0100] [ALPM] upgraded lib32-zlib-ng-compat (2.2.5-1 -> 2.3.2-2)
[2025-12-04T09:17:43+0100] [ALPM] upgraded lib32-libpng (1.6.51-1 -> 1.6.52-1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded libxkbcommon (1.13.0-1.1 -> 1.13.1-1.1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded lib32-libxkbcommon (1.13.0-1 -> 1.13.1-1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded lib32-sqlite (3.51.0-1 -> 3.51.1-1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded libxkbcommon-x11 (1.13.0-1.1 -> 1.13.1-1.1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded zix (0.6.2-1.1 -> 0.8.0-1.1)
[2025-12-04T09:17:43+0100] [ALPM] upgraded lilv (0.24.26-1 -> 0.26.2-1)
[2025-12-04T09:17:44+0100] [ALPM] upgraded linux-cachyos-bore (6.17.9-1 -> 6.18.0-1)
[2025-12-04T09:17:45+0100] [ALPM] upgraded linux-cachyos-bore-headers (6.17.9-1 -> 6.18.0-1)
[2025-12-04T09:17:45+0100] [ALPM] upgraded linux-cachyos-lts (6.12.59-2 -> 6.12.60-2)
[2025-12-04T09:17:46+0100] [ALPM] upgraded linux-cachyos-lts-headers (6.12.59-2 -> 6.12.60-2)
[2025-12-04T09:17:46+0100] [ALPM] upgraded qt6-base (6.10.1-1.1 -> 6.10.1-2)
[2025-12-04T09:17:46+0100] [ALPM] upgraded scx-scheds (1.0.18-2 -> 1.0.19-1)
[2025-12-04T09:17:46+0100] [ALPM] upgraded scx-manager (1.15.7-1 -> 1.15.8-1)
[2025-12-04T09:17:46+0100] [ALPM] transaction completed

I am using ollama-rocm from cachyos-extra-znver4 and it is at version 0.13.0-2.1 and I have installed rocm version 7.1.0-1

afterglow1284 avatar Dec 04 '25 14:12 afterglow1284

OK, can you roll back the kernel to 6.17.9-1 and ensure it doesn't repro? If that's the case I believe we need a bisect on the kernel tree.

superm1 avatar Dec 04 '25 14:12 superm1

I have tried rolling back the kernel after enabling the service and everything works fine. I have also updated it again to verify and the issue persist

afterglow1284 avatar Dec 04 '25 14:12 afterglow1284

Would you mind to try the vanilla archlinux kernel shortly?

https://archlinux.org/packages/core-testing/x86_64/linux/

Here click the "Download" button and then go into your Downloads directory and do:

cd Downloads
sudo pacman -U linux-6.18.arch1-1-x86_64.pkg.tar.zst

and when booting select the "linux" kernel.

ptr1337 avatar Dec 04 '25 14:12 ptr1337

I can confirm the issue persists with the vanilla kernel from core-testing

afterglow1284 avatar Dec 04 '25 14:12 afterglow1284

I can confirm the issue persists with the vanilla kernel from core-testing

Mhm, this would require then bisecting through the kernel tree and find the offending commit.

ptr1337 avatar Dec 05 '25 16:12 ptr1337

I just wanted to chime in and inform you that the issue is not distro specific but also happens on Debian SID with a self-built kernel 6.18.0 when running ollama in the background. SDDM and kwin did not expose any problems but starting games causes GPU hangs and resets. Reverting to 6.17 solves the issue. So yeah, something is off with amdgpu...

gladiac avatar Dec 06 '25 08:12 gladiac

I just wanted to chime in and inform you that the issue is not distro specific but also happens on Debian SID with a self-built kernel 6.18.0 when running ollama in the background. SDDM and kwin did not expose any problems but starting games causes GPU hangs and resets. Reverting to 6.17 solves the issue. So yeah, something is off with amdgpu...

Do you know how to bisect?

ptr1337 avatar Dec 06 '25 10:12 ptr1337

I just wanted to chime in and inform you that the issue is not distro specific but also happens on Debian SID with a self-built kernel 6.18.0 when running ollama in the background. SDDM and kwin did not expose any problems but starting games causes GPU hangs and resets. Reverting to 6.17 solves the issue. So yeah, something is off with amdgpu...

Do you know how to bisect?

Yes, but mostly in theory. Looking at the amount of commits this will be a quite time-consuming effort. I will see what I can do in the coming days but I cannot make any promises.

gladiac avatar Dec 06 '25 11:12 gladiac

I just wanted to chime in and inform you that the issue is not distro specific but also happens on Debian SID with a self-built kernel 6.18.0 when running ollama in the background. SDDM and kwin did not expose any problems but starting games causes GPU hangs and resets. Reverting to 6.17 solves the issue. So yeah, something is off with amdgpu...

Do you know how to bisect?

Yes, but mostly in theory. Looking at the amount of commits this will be a quite time-consuming effort. I will see what I can do in the coming days but I cannot make any promises.

Thank you! Yes, the major kernel commits are all time really much :( You could reduce the effort with installing each rc kernel first, e.g rc1 and see if its also there. This one has the biggest diff in commit to 6.17

ptr1337 avatar Dec 06 '25 11:12 ptr1337

Can you please try with amdgpu.dcdebugmask=0x610 on kernel command line with broken kernel?

If this helps it can narrow down the directory to bisect a lot.

superm1 avatar Dec 06 '25 13:12 superm1

Alright I configured amdgpu.dcdebugmask=0x610 on the kernel command line. These are the errors that I am getting in dmesg when I start a game (in this case Ghost of Tsushima) while having ollama running in the background:

[   66.818006] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State
[   66.822493] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State Completed
[   66.822544] amdgpu 0000:0c:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[   66.822546] amdgpu 0000:0c:00.0: amdgpu: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
[   66.822548] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 timeout, signaled seq=1, emitted seq=3
[   66.822552] amdgpu 0000:0c:00.0: amdgpu:  Process GhostOfTsushima pid 4762 thread vkd3d_queue pid 4863
[   66.822556] amdgpu 0000:0c:00.0: amdgpu: Starting comp_1.1.0 ring reset
[   66.822570] amdgpu 0000:0c:00.0: amdgpu: reset compute queue (1:1:0)
[   66.822608] amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:0 pasid:0)
[   66.822614] amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[   66.822617] amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B5A
[   66.822620] amdgpu 0000:0c:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[   66.822622] amdgpu 0000:0c:00.0: amdgpu:      MORE_FAULTS: 0x0
[   66.822625] amdgpu 0000:0c:00.0: amdgpu:      WALKER_ERROR: 0x5
[   66.822627] amdgpu 0000:0c:00.0: amdgpu:      PERMISSION_FAULTS: 0x5
[   66.822629] amdgpu 0000:0c:00.0: amdgpu:      MAPPING_ERROR: 0x1
[   66.822631] amdgpu 0000:0c:00.0: amdgpu:      RW: 0x1
[   66.822691] amdgpu 0000:0c:00.0: amdgpu: Ring comp_1.1.0 reset succeeded
[   66.822694] amdgpu 0000:0c:00.0: [drm] device wedged, but recovered through reset
[   77.055417] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State
[   77.057190] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State Completed
[   77.057197] amdgpu 0000:0c:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[   77.057199] amdgpu 0000:0c:00.0: amdgpu: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
[   77.057201] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 timeout, signaled seq=1, emitted seq=5
[   77.057204] amdgpu 0000:0c:00.0: amdgpu:  Process GhostOfTsushima pid 4762 thread vkd3d_queue pid 4863
[   77.057206] amdgpu 0000:0c:00.0: amdgpu: Starting comp_1.0.1 ring reset
[   77.057216] amdgpu 0000:0c:00.0: amdgpu: reset compute queue (1:0:1)
[   77.057352] amdgpu 0000:0c:00.0: amdgpu: Ring comp_1.0.1 reset succeeded
[   77.057355] amdgpu 0000:0c:00.0: [drm] device wedged, but recovered through reset
[   77.058124] amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:0 pasid:0)
[   77.058129] amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[   77.058133] amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B5A
[   77.058136] amdgpu 0000:0c:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[   77.058138] amdgpu 0000:0c:00.0: amdgpu:      MORE_FAULTS: 0x0
[   77.058140] amdgpu 0000:0c:00.0: amdgpu:      WALKER_ERROR: 0x5
[   77.058143] amdgpu 0000:0c:00.0: amdgpu:      PERMISSION_FAULTS: 0x5
[   77.058145] amdgpu 0000:0c:00.0: amdgpu:      MAPPING_ERROR: 0x1
[   77.058147] amdgpu 0000:0c:00.0: amdgpu:      RW: 0x1
[   87.293602] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State
[   87.298529] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State Completed
[   87.298538] amdgpu 0000:0c:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[   87.298541] amdgpu 0000:0c:00.0: amdgpu: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
[   87.298543] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 timeout, signaled seq=3, emitted seq=5
[   87.298547] amdgpu 0000:0c:00.0: amdgpu:  Process GhostOfTsushima pid 4762 thread vkd3d_queue pid 4863
[   87.298551] amdgpu 0000:0c:00.0: amdgpu: Starting comp_1.0.1 ring reset
[   87.298564] amdgpu 0000:0c:00.0: amdgpu: reset compute queue (1:0:1)
[   87.298667] amdgpu 0000:0c:00.0: amdgpu: Ring comp_1.0.1 reset succeeded
[   87.298670] amdgpu 0000:0c:00.0: [drm] device wedged, but recovered through reset

The devcoredump data is HERE.

gladiac avatar Dec 06 '25 15:12 gladiac

Try this patch:

https://lore.kernel.org/amd-gfx/[email protected]/

superm1 avatar Dec 06 '25 16:12 superm1