sway icon indicating copy to clipboard operation
sway copied to clipboard

Programs crashing and system freezes with AMD graphics

Open yagarea opened this issue 5 months ago • 19 comments

I moved my install from intel i7 + intel iris to AMD ryzen 5 + AMD radeon, by swaping SSDs. I experience these issues:


Symptoms

  • Everything works fine except randomly load of CPU explodes to 30+ and system freezes for few minutes.
  • All firefox, brave and electron apps crash randomly. I highly suspect it will be something wit HW acceleration.
  • When booting I see AMDGPU securedisplay: generic failrure in log and this is in journalctl:
Dec 18 02:26:00 archlinux kernel: amdgpu 0000:04:00.0: amdgpu: Secure display: Generic Failure.
Dec 18 02:26:00 archlinux kernel: amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
  • Sway randomly crashed
  • Discord is crashing with message:
[109734:1216/144012.453719:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
[109734:1216/144018.594445:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
[109734:1216/144025.341187:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
[109688:1216/144035.692207:ERROR:gpu_process_host.cc(991)] GPU process exited unexpectedly: exit_code=9
(electron) 'gpu-process-crashed event' is deprecated and will be removed. Please use 'child-process-gone event' instead.
notificationScreen.webContentsSend: win is invalid undefined.
child-process-gone! child: GPU (undefined) exitCode: 9
blackbox: 2023-12-16T13:40:35.738Z 59 before-quit
blackbox: 2023-12-16T13:40:35.763Z 60 window.close win8
blackbox: 2023-12-16T13:40:35.772Z 61 ❌ child-process-gone { type: 'GPU', reason: 'killed', exitCode: 9, serviceName: 'GPU' }
blackbox: 2023-12-16T13:40:36.319Z 62 webContents.destroyed web8
  • this is Brave browser log:
[158080:158080:1218/173945.959830:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 1 times!
[158080:158080:1218/173945.966721:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 2 times!
[158080:158080:1218/173950.934755:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 3 times!
Warning: terminator_CreateInstance: Failed to CreateInstance in ICD 0.  Skipping ICD.
Warning: terminator_CreateInstance: Found no drivers!
Warning: vkCreateInstance failed with VK_ERROR_INCOMPATIBLE_DRIVER
    at CheckVkSuccessImpl (../../third_party/dawn/src/dawn/native/vulkan/VulkanError.cpp:101)
    at CreateVkInstance (../../third_party/dawn/src/dawn/native/vulkan/BackendVk.cpp:493)
    at Initialize (../../third_party/dawn/src/dawn/native/vulkan/BackendVk.cpp:379)
    at Create (../../third_party/dawn/src/dawn/native/vulkan/BackendVk.cpp:301)
    at operator() (../../third_party/dawn/src/dawn/native/vulkan/BackendVk.cpp:556)
  • After one strong freeze I found this in the journalctl:
Dec 23 22:19:05 arch-thinkpad kernel: Out of memory: Killed process 2068 (electron) total-vm:1190296860kB, anon-rss:175928kB, file-rss:256kB, shmem-rss:600kB, UID:1000 pgtables:1752kB oom_score_adj:300
Dec 23 22:21:22 arch-thinkpad kernel: Out of memory: Killed process 10923 (brave) total-vm:1186204220kB, anon-rss:51484kB, file-rss:512kB, shmem-rss:3648kB, UID:1000 pgtables:928kB oom_score_adj:300
Dec 23 22:21:33 arch-thinkpad kernel: Out of memory: Killed process 8797 (brave) total-vm:1186202212kB, anon-rss:51528kB, file-rss:256kB, shmem-rss:1084kB, UID:1000 pgtables:1008kB oom_score_adj:300
Dec 23 22:21:34 arch-thinkpad kernel: Out of memory: Killed process 8753 (brave) total-vm:1186222336kB, anon-rss:50016kB, file-rss:384kB, shmem-rss:1252kB, UID:1000 pgtables:1016kB oom_score_adj:300
  • Firefox very often crashes with this log:
[Parent 2974, IPC I/O Parent] WARNING: process 3081 exited on signal 11: file /build/firefox/src/firefox-121.0/ipc/chromium/src/base/process_util_posix.cc:265
[Parent 2974, IPC I/O Parent] WARNING: process 3917 exited on signal 11: file /build/firefox/src/firefox-121.0/ipc/chromium/src/base/process_util_posix.cc:265
[Parent 2974, IPC I/O Parent] WARNING: process 3958 exited on signal 11: file /build/firefox/src/firefox-121.0/ipc/chromium/src/base/process_util_posix.cc:265
[Parent 2974, IPC I/O Parent] WARNING: process 3633 exited on signal 11: file /build/firefox/src/firefox-121.0/ipc/chromium/src/base/process_util_posix.cc:265
ExceptionHandler::GenerateDump cloned child 7802
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Failed to open curl lib from binary, use libcurl.so instead
ExceptionHandler::GenerateDump cloned child 10410
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Failed to open curl lib from binary, use libcurl.so instead

Chromium browser straight up crashes immidietly on lauch:

[32424:32424:0109/161137.203058:ERROR:policy_logger.cc(156)] :components/enterprise/browser/controller/chrome_browser_cloud_management_controller.cc(161) Cloud management controller initialization aborted as CBCM is not enabled. Please use the `--enable-chrome-browser-cloud-management` command line flag to enable it if you are not using the official Google Chrome build.
[32500:1:0109/161137.867689:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[32424:32424:0109/161137.868183:ERROR:gpu_process_host.cc(992)] GPU process exited unexpectedly: exit_code=139
[32500:1:0109/161138.480418:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[32573:1:0109/161138.480549:ERROR:command_buffer_proxy_impl.cc(319)] GPU state invalid after WaitForGetOffsetInRange.
[32583:1:0109/161138.481906:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[32424:32452:0109/161138.483240:ERROR:shared_image_interface_proxy.cc(136)] Buffer handle is null. Not creating a mailbox from it.
[32424:32452:0109/161138.483260:ERROR:one_copy_raster_buffer_provider.cc(365)] Creation of MappableSharedImage failed.
[32424:32452:0109/161138.483313:ERROR:shared_image_interface_proxy.cc(136)] Buffer handle is null. Not creating a mailbox from it.
[32424:32424:0109/161138.484013:ERROR:gpu_process_host.cc(992)] GPU process exited unexpectedly: exit_code=139
[32543:1:0109/161139.231084:ERROR:command_buffer_proxy_impl.cc(319)] GPU state invalid after WaitForGetOffsetInRange.
[32573:1:0109/161139.231223:ERROR:command_buffer_proxy_impl.cc(319)] GPU state invalid after WaitForGetOffsetInRange.
[32424:32452:0109/161139.232795:ERROR:shared_image_interface_proxy.cc(136)] Buffer handle is null. Not creating a mailbox from it.
[32424:32452:0109/161139.232810:ERROR:one_copy_raster_buffer_provider.cc(365)] Creation of MappableSharedImage failed.
[32424:32424:0109/161139.232827:ERROR:gpu_process_host.cc(992)] GPU process exited unexpectedly: exit_code=139
[32714:32714:0109/161139.240747:ERROR:angle_platform_impl.cc(44)] Display.cpp:1052 (initialize): ANGLE Display::initialize error 0: Internal Vulkan error (-3): Initialization of an object could not be completed for implementation-specific reasons, in ../../third_party/angle/src/libANGLE/renderer/vulkan/RendererVk.cpp, initialize:1711.
ERR: Display.cpp:1052 (initialize): ANGLE Display::initialize error 0: Internal Vulkan error (-3): Initialization of an object could not be completed for implementation-specific reasons, in ../../third_party/angle/src/libANGLE/renderer/vulkan/RendererVk.cpp, initialize:1711.
[32714:32714:0109/161139.240832:ERROR:gl_display.cc(515)] EGL Driver message (Critical) eglInitialize: Internal Vulkan error (-3): Initialization of an object could not be completed for implementation-specific reasons, in ../../third_party/angle/src/libANGLE/renderer/vulkan/RendererVk.cpp, initialize:1711.
[32714:32714:0109/161139.240859:ERROR:gl_display.cc(786)] eglInitialize SwANGLE failed with error EGL_NOT_INITIALIZED
[32714:32714:0109/161139.240894:ERROR:gl_display.cc(820)] Initialization of all EGL display types failed.
[32714:32714:0109/161139.240917:ERROR:gl_ozone_egl.cc(26)] GLDisplayEGL::Initialize failed.
[32714:32714:0109/161139.241169:ERROR:angle_platform_impl.cc(44)] Display.cpp:1052 (initialize): ANGLE Display::initialize error 0: Internal Vulkan error (-3): Initialization of an object could not be completed for implementation-specific reasons, in ../../third_party/angle/src/libANGLE/renderer/vulkan/RendererVk.cpp, initialize:1711.
ERR: Display.cpp:1052 (initialize): ANGLE Display::initialize error 0: Internal Vulkan error (-3): Initialization of an object could not be completed for implementation-specific reasons, in ../../third_party/angle/src/libANGLE/renderer/vulkan/RendererVk.cpp, initialize:1711.
[32714:32714:0109/161139.241204:ERROR:gl_display.cc(515)] EGL Driver message (Critical) eglInitialize: Internal Vulkan error (-3): Initialization of an object could not be completed for implementation-specific reasons, in ../../third_party/angle/src/libANGLE/renderer/vulkan/RendererVk.cpp, initialize:1711.
[32714:32714:0109/161139.241228:ERROR:gl_display.cc(786)] eglInitialize SwANGLE failed with error EGL_NOT_INITIALIZED
[32714:32714:0109/161139.241250:ERROR:gl_display.cc(820)] Initialization of all EGL display types failed.
[32714:32714:0109/161139.241277:ERROR:gl_ozone_egl.cc(26)] GLDisplayEGL::Initialize failed.
[32714:32714:0109/161139.242947:ERROR:viz_main_impl.cc(196)] Exiting GPU process due to errors during initialization
[32424:32424:0109/161139.654097:ERROR:gpu_process_host.cc(992)] GPU process exited unexpectedly: exit_code=139
[32424:32424:0109/161140.509961:FATAL:gpu_data_manager_impl_private.cc(448)] GPU process isn't usable. Goodbye.
fish: Job 1, 'chromium 2> chromium-log' terminated by signal SIGTRAP (Trace or breakpoint trap)

More information

Specs:

  • I use HP laptop with arch linux + wayland + sway version 1.8.1
  • CPU: AMD Ryzen 5 4500U with Radeon Graphics (6) @ 2.375GHz
  • GPU: AMD ATI Radeon RX Vega 6
  • I have 8Gb or RAM and no SWAP

This is list of installed packages containing amd:

local/amd-ucode 20231211.f2e52a1c-1
    Microcode update image for AMD CPUs
local/amf-headers 1.4.32-1
    Header files for AMD Advanced Media Framework
local/libteam 1.32-1
    Library for controlling team network device
local/nvtop 3.0.2-1
    GPUs process monitoring for AMD, Intel and NVIDIA
local/rocm-dbgapi 5.7.1-1
    Support library necessary for a debugger of AMD's GPUs

This is list of packages containing radeon:

local/hsakmt-roct 5.7.1-1
    Radeon Open Compute Thunk Interface
local/lib32-vulkan-radeon 1:23.3.1-1
    Radeon's Vulkan mesa driver (32-bit)
local/vulkan-radeon 1:23.3.1-1
    Radeon's Vulkan mesa driver

This is output of lspci -nn:

00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 0 [1022:1448]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 1 [1022:1449]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 2 [1022:144a]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 3 [1022:144b]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 4 [1022:144c]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 5 [1022:144d]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 6 [1022:144e]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 7 [1022:144f]
01:00.0 Network controller [0280]: Realtek Semiconductor Co., Ltd. RTL8822CE 802.11ac PCIe Wireless Network Adapter [10ec:c822]
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader [10ec:522a] (rev 01)
03:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir [Radeon RX Vega 6 (Ryzen 4000/5000 Mobile Series)] [1002:1636] (rev c3)
04:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
04:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
04:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
04:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
04:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor [1022:15e2] (rev 01)
04:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller [1022:15e3]
04:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]

Output of lspci -k | grep -A 3 -E "(VGA|3D)":

04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Renoir [Radeon RX Vega 6 (Ryzen 4000/5000 Mobile Series)] (rev c3)
	DeviceName: AMD Radeon(TM) Graphics
	Subsystem: Hewlett-Packard Company Renoir [Radeon RX Vega 6 (Ryzen 4000/5000 Mobile Series)]
	Kernel driver in use: amdgpu

I have found this in dmesg log:

[    2.906817] [drm] amdgpu kernel modesetting enabled.
[    2.907464] amdgpu: Virtual CRAT table created for CPU
[    2.907479] amdgpu: Topology: Add CPU node
[    2.907685] amdgpu 0000:04:00.0: enabling device (0006 -> 0007)
[    2.907739] [drm] initializing kernel modesetting (RENOIR 0x1002:0x1636 0x103C:0x876E 0xC3).
[    2.934103] [drm] register mmio base: 0xD0400000
[    2.934114] [drm] register mmio size: 524288
[    2.937640] [drm] add ip block number 0 <soc15_common>
[    2.937644] [drm] add ip block number 1 <gmc_v9_0>
[    2.937646] [drm] add ip block number 2 <vega10_ih>
[    2.937647] [drm] add ip block number 3 <psp>
[    2.937649] [drm] add ip block number 4 <smu>
[    2.937651] [drm] add ip block number 5 <dm>
[    2.937652] [drm] add ip block number 6 <gfx_v9_0>
[    2.937654] [drm] add ip block number 7 <sdma_v4_0>
[    2.937655] [drm] add ip block number 8 <vcn_v2_0>
[    2.937656] [drm] add ip block number 9 <jpeg_v2_0>
[    2.937681] amdgpu 0000:04:00.0: amdgpu: Fetched VBIOS from VFCT
[    2.937684] amdgpu: ATOM BIOS: 113-RENOIR-026
[    2.939454] [drm] VCN decode is enabled in VM mode
[    2.939456] [drm] VCN encode is enabled in VM mode
[    2.940642] [drm] JPEG decode is enabled in VM mode
[    2.954319] Console: switching to colour dummy device 80x25
[    2.988195] amdgpu 0000:04:00.0: vgaarb: deactivate vga console
[    2.988210] amdgpu 0000:04:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
[    2.988218] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[    2.988323] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    2.988337] amdgpu 0000:04:00.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
[    2.988342] amdgpu 0000:04:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
[    2.988345] amdgpu 0000:04:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[    2.988357] [drm] Detected VRAM RAM=512M, BAR=512M
[    2.988360] [drm] RAM width 128bits DDR4
[    2.988686] [drm] amdgpu: 512M of VRAM memory ready
[    2.988692] [drm] amdgpu: 3661M of GTT memory ready.
[    2.988726] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    2.988934] [drm] PCIE GART of 1024M enabled.
[    2.988936] [drm] PTB located at 0x000000F41FC00000
[    2.989313] [drm] Loading DMUB firmware via PSP: version=0x01010028
[    2.989963] [drm] Found VCN firmware Version ENC: 1.21 DEC: 6 VEP: 0 Revision: 0
[    2.989970] amdgpu 0000:04:00.0: amdgpu: Will use PSP to load VCN firmware
[    3.649478] [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR
[    3.744619] amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    3.754153] amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available
[    3.759165] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[    3.759289] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
[    3.759293] amdgpu 0000:04:00.0: amdgpu: Secure display: Generic Failure.
[    3.759305] amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[    3.759720] amdgpu 0000:04:00.0: amdgpu: SMU is initialized successfully!
[    3.760994] [drm] Display Core v3.2.247 initialized on DCN 2.1
[    3.760998] [drm] DP-HDMI FRL PCON supported
[    3.761775] [drm] DMUB hardware initialized: version=0x01010028
[    4.104152] [drm] Alt mode has timed out after 202 ms
[    4.106330] [drm] kiq ring mec 2 pipe 1 q 0
[    4.109571] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[    4.109599] [drm] JPEG decode initialized successfully.
[    4.136633] amdgpu: HMM registered 512MB device memory
[    4.138946] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    4.138968] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    4.139127] amdgpu: Virtual CRAT table created for GPU
[    4.139236] amdgpu: Topology: Add dGPU node [0x1636:0x1002]
[    4.139239] kfd kfd: amdgpu: added device 1002:1636
[    4.139253] amdgpu 0000:04:00.0: amdgpu: SE 1, SH per SE 1, CU per SH 8, active_cu_number 6
[    4.139421] amdgpu 0000:04:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[    4.139424] amdgpu 0000:04:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
[    4.139425] amdgpu 0000:04:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
[    4.139427] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
[    4.139428] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
[    4.139430] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
[    4.139431] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
[    4.139432] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
[    4.139434] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
[    4.139435] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
[    4.139437] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
[    4.139438] amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
[    4.139439] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
[    4.139441] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 8
[    4.139443] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 8
[    4.139444] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 8
[    4.139445] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 8
[    4.141078] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:04:00.0 on minor 1
[    4.147509] fbcon: amdgpudrmfb (fb0) is primary device
[    4.147823] [drm] DSC precompute is not needed.
[    4.854418] Console: switching to colour frame buffer device 240x67
[    4.874494] amdgpu 0000:04:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[   11.871434] Key type trusted registered
[   11.886779] Key type encrypted registered

...

[ 3001.667474] ThreadPoolForeg[7212]: segfault at 3fe173700008 ip 00005632b4aca97b sp 00007f667e7efdc0 error 4 in electron[5632b332f000+7732000] likely on CPU 0 (core 0, socket 0)
[ 3001.667500] Code: 06 48 03 08 5d e9 25 0c 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 18 49 89 cd 49 81 e5 00 00 fc ff <41> f6 45 08 40 0f 85 a1 00 00 00 49 89 cf 49 89 f6 80 7f 3a 00 75
[ 4538.486544] electron[8432]: segfault at 80000000008 ip 00005578c8450e12 sp 00007fff90612410 error 4 in electron[5578c26d9000+7aaf000] likely on CPU 4 (core 5, socket 0)
[ 4538.486557] Code: f0 0f b1 0f 75 f2 eb 06 67 e8 7a 7d dc fe 48 8b bb b0 01 00 00 48 85 ff 0f 85 8e 01 00 00 48 8b bb a8 01 00 00 48 85 ff 74 24 <8b> 47 08 a8 02 75 1d 8b 07 0f 1f 44 00 00 83 f8 01 74 0b 8d 48 ff
[ 4595.501146] Isolated Web Co[10781]: segfault at 7702dd668c80 ip 00007f02e972e336 sp 00007ffc2f669840 error 4 in libxul.so[7f02e8f3a000+5f91000] likely on CPU 4 (core 5, socket 0)
[ 4595.501158] Code: 24 20 48 8b b4 24 a0 00 00 00 48 85 f6 0f 84 2c 01 00 00 8b 84 24 a8 00 00 00 48 3d ff ff ff 7f 0f 83 4e 0b 00 00 48 8b 4d 00 <8b> 11 0f ba e2 1e 0f 83 67 01 00 00 8b 79 08 48 29 f9 4c 89 4c 24

this is output of eglinfo -B:

GBM platform:
eglinfo: eglInitialize failed

Wayland platform:
EGL API version: 1.5
EGL vendor string: Mesa Project
EGL version string: 1.5
EGL client APIs: OpenGL OpenGL_ES
OpenGL core profile vendor: AMD
OpenGL core profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL core profile version: 4.6 (Core Profile) Mesa 23.3.2-arch1.2
OpenGL core profile shading language version: 4.60
OpenGL compatibility profile vendor: AMD
OpenGL compatibility profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL compatibility profile version: 4.6 (Compatibility Profile) Mesa 23.3.2-arch1.2
OpenGL compatibility profile shading language version: 4.60
OpenGL ES profile vendor: AMD
OpenGL ES profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL ES profile version: OpenGL ES 3.2 Mesa 23.3.2-arch1.2
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

X11 platform:
EGL API version: 1.5
EGL vendor string: Mesa Project
EGL version string: 1.5
EGL client APIs: OpenGL OpenGL_ES
OpenGL core profile vendor: AMD
OpenGL core profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL core profile version: 4.6 (Core Profile) Mesa 23.3.2-arch1.2
OpenGL core profile shading language version: 4.60
OpenGL compatibility profile vendor: AMD
OpenGL compatibility profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL compatibility profile version: 4.6 (Compatibility Profile) Mesa 23.3.2-arch1.2
OpenGL compatibility profile shading language version: 4.60
OpenGL ES profile vendor: AMD
OpenGL ES profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL ES profile version: OpenGL ES 3.2 Mesa 23.3.2-arch1.2
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

Surfaceless platform:
EGL API version: 1.5
EGL vendor string: Mesa Project
EGL version string: 1.5
EGL client APIs: OpenGL OpenGL_ES
OpenGL core profile vendor: AMD
OpenGL core profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL core profile version: 4.6 (Core Profile) Mesa 23.3.2-arch1.2
OpenGL core profile shading language version: 4.60
OpenGL compatibility profile vendor: AMD
OpenGL compatibility profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL compatibility profile version: 4.6 (Compatibility Profile) Mesa 23.3.2-arch1.2
OpenGL compatibility profile shading language version: 4.60
OpenGL ES profile vendor: AMD
OpenGL ES profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL ES profile version: OpenGL ES 3.2 Mesa 23.3.2-arch1.2
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

Device platform:
Device #0:

Platform Device platform:
EGL API version: 1.5
EGL vendor string: Mesa Project
EGL version string: 1.5
EGL client APIs: OpenGL OpenGL_ES
OpenGL core profile vendor: AMD
OpenGL core profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL core profile version: 4.6 (Core Profile) Mesa 23.3.2-arch1.2
OpenGL core profile shading language version: 4.60
OpenGL compatibility profile vendor: AMD
OpenGL compatibility profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL compatibility profile version: 4.6 (Compatibility Profile) Mesa 23.3.2-arch1.2
OpenGL compatibility profile shading language version: 4.60
OpenGL ES profile vendor: AMD
OpenGL ES profile renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 16.0.6, DRM 3.54, 6.6.9-arch1-1)
OpenGL ES profile version: OpenGL ES 3.2 Mesa 23.3.2-arch1.2
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

Device #1:

Platform Device platform:
EGL API version: 1.5
EGL vendor string: Mesa Project
EGL version string: 1.5
EGL client APIs: OpenGL OpenGL_ES
OpenGL core profile vendor: Mesa
OpenGL core profile renderer: llvmpipe (LLVM 16.0.6, 256 bits)
OpenGL core profile version: 4.5 (Core Profile) Mesa 23.3.2-arch1.2
OpenGL core profile shading language version: 4.50
OpenGL compatibility profile vendor: Mesa
OpenGL compatibility profile renderer: llvmpipe (LLVM 16.0.6, 256 bits)
OpenGL compatibility profile version: 4.5 (Compatibility Profile) Mesa 23.3.2-arch1.2
OpenGL compatibility profile shading language version: 4.50
OpenGL ES profile vendor: Mesa
OpenGL ES profile renderer: llvmpipe (LLVM 16.0.6, 256 bits)
OpenGL ES profile version: OpenGL ES 3.2 Mesa 23.3.2-arch1.2
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

These are my kernel parameters:

Kernel command line: initrd=\amd-ucode.img initrd=\initramfs-linux.img root="LABEL=arch_os" rw i915.enable_psr=0 rd.luks.name=02e...eee=root CONFIG_DRM_AMDGPU_CIK=Y

What I tried:

  • I installed AMD microcode and set it up in loader options in systemdboot
  • I updated kernel to newest one
  • I installed xf86-video-amdgpu
  • I updated all firmware using fwupdmgr
  • I uninstalled X.org drivers (xf86-video-amdgpu)
  • I uninstalled all amdvlk drivers and installed radeon drivers
  • I installed desktop portals
  • I added CONFIG_DRM_AMDGPU_CIK=Y to kernel parameters. This results in Unknown kernel command line parameters "CONFIG_DRM_AMDGPU_CIK=Y", will be passed to user space. in dmesg.
  • I tried to addd radeon.dpm=0 and radeon.dpm=1 with no effect
  • I tried to log debug log of sway and found nothing interesting
  • I deleted ~/.cache/mesa directory

Nothing seems to resolve this issue.


Question

What should I do to get rid of this freezes and crashes ? Do you have any tips how to diagnose these kind of issues ?
Thank you for help

yagarea avatar Jan 16 '24 08:01 yagarea

  • discord/brave/electron/... all crash due to gpu driver issues. sadly i can't tell you why or how to fix those
  • the journalctl oom-killer crashes you see are due to running out of memory (that's why everything freezes until the kernel oom-kills something)

none of these are related to sway. though if they don't reproduce on another wayland compositor that would be interesting, but i bet it would be the same

arch linux

do you have the mesa package? that is the one with the regular opengl drivers (check for /usr/lib/dri/radeonsi_dri.so). from the chromium logs it looks like it only tries vulkan, which it normally doesn't do

nekopsykose avatar Jan 17 '24 16:01 nekopsykose

glxinfo and eglinfo should list radeonsi (the driver) in the output somewhere, and glxgears should work to show you some gears (from mesa-demos/mesa-utils)

nekopsykose avatar Jan 17 '24 16:01 nekopsykose

ah right, you posted eglinfo already. yeah the driver is present there..

nekopsykose avatar Jan 17 '24 16:01 nekopsykose

Confirming the same issue, tested on Asus PN50 with a Ryzen 4700U and integrated graphics. Computer completely freezes.

supermarin avatar Jan 17 '24 18:01 supermarin

~~Does reverting 7e69a7076fc8a4eb788e0229b1c99dd0b7b04bb7 fix it for you? It does for me.~~

Edit: Don't know how I missed #7897

FakeMichau avatar Jan 18 '24 02:01 FakeMichau

ah, i forgot that chromium is still most likely depending on that. that explains all of those cases at least

someone would have to tell the chromium developers about it

nekopsykose avatar Jan 18 '24 06:01 nekopsykose

For me it isn't really chromium-based applications (Discord on wayland works with HW acceleration) but more like an issue with xwayland - can't launch Cyberpunk, window would flash for a second and then disappear. The error I got is: vulkan: No DRI3 support detected - required for presentation Note: you can probably enable DRI3 in your Xorg config

FakeMichau avatar Jan 18 '24 22:01 FakeMichau

Does reverting 7e69a70 fix it for you? It does for me.

I've observed this on 1.8.1, so that commit wasn't there in the first place. Seeing your answer above, I'm wondering if we're talking about two different issues. In fact, our issue might be different than OP's.

In my case, the whole system freezes on AMD. I've just switched to the AMD system and observed it for the first time in sway. Will put in some more testing to see if other DEs freeze on this machine and kernel, and do a memtest since I could be running into a hardware issue as well.

supermarin avatar Jan 19 '24 01:01 supermarin

The crash of electron apps you're observing is present on Intel as well, and I found it's correlated with high dpi screens where you use scaling, in combination of using Ozone. Running those apps on Xwayland doesn't crash them but makes them blurry.

supermarin avatar Jan 19 '24 01:01 supermarin

I was having freezes akin to this, I switched back to linux 6.6.10 and they went away

Kommynct avatar Jan 21 '24 22:01 Kommynct

Which kernel did you observe them on? I haven't seen one in a couple of days now, but can't remember if I was on a fresher kernel before. Right now on 6.1.72 from nixos/nixpkgs#a68bc4feaf4bbf4b626226ff8f0f8110588d4ebc

supermarin avatar Jan 22 '24 14:01 supermarin

the 6.7 series

Kommynct avatar Jan 22 '24 16:01 Kommynct

Since commit https://github.com/swaywm/sway/commit/7e69a7076fc8a4eb788e0229b1c99dd0b7b04bb7, all Vulkan application running through XWayland crash for me, saying that DRI3 is not available.

Edit: -Dlegacy-wl-drm is the way to go. I missed that while scrolling down!

Scrumplex avatar Jan 23 '24 13:01 Scrumplex

Yeah, please try -Dlegacy-wl-drm if you have issues.

emersion avatar Jan 23 '24 13:01 emersion

-Dlegacy-wl-drm

How can I try it ?

yagarea avatar Jan 23 '24 15:01 yagarea

Assuming that your version has commit https://github.com/swaywm/sway/commit/08a06a7b6bbb324e9fc6e49e96379340404135b4, you just need to add that flag to your sway call. If you are using some kind of display manager, you might need to read into how to do that there.

Scrumplex avatar Jan 23 '24 15:01 Scrumplex

this is occurring with the current archlinux release of sway, which is pre-that commit for me, although I think it's a kernel issue since downgrading to 6.6.10 fixes it for me

Kommynct avatar Jan 24 '24 17:01 Kommynct

I discovered that creating swap stops the freezes. Apps still crash but system will not freeze. They just die quicker...

I still call it progress

yagarea avatar Jan 24 '24 18:01 yagarea

~@emersion perhaps we should split the "No DRI3 support detected" problem into a separate issue? It doesn't sound like it's the same as @yagarea's and I'd like to still have it tracked so that we don't have to use -Dlegacy-wl-drm to run Vulkan apps on Xwayland.~

Fixed in xwayland git so won't bother tracking.

ChrisLane avatar Jan 31 '24 09:01 ChrisLane

Update: I've replaced ram modules and crashes went away on that machine. memtest86 with the old memory wasn't completing, so I'll blame it on ram.

supermarin avatar Mar 02 '24 00:03 supermarin

Update: I've replaced ram modules and crashes went away on that machine. memtest86 with the old memory wasn't completing, so I'll blame it on ram.

Well, It turned out that these issues ware caused by faulty RAM even in my case. I can not prove it, but moving to another intel based laptop made these problems disappear.

If someone can prove and decide if this was caused entirely by defective RAM feel free to close this issue.

yagarea avatar Mar 02 '24 10:03 yagarea

@emersion I guess we can close it and reopen if someone else comes back with the same issue. Both @yagarea and I are back on Intel for DE machine, so it's going to be hard to drill this down further.

supermarin avatar Mar 02 '24 15:03 supermarin