HIP
HIP copied to clipboard
System freezes after error: 'hipErrorOutOfMemory'(2) at square.cpp:76
System information
❯ inxi -GSC -xx
System: Host: ernie Kernel: 5.7.9 x86_64 bits: 64 compiler: gcc v: 10.1.0 Desktop: N/A wm: kwin_x11 dm: SDDM
Distro: Gentoo Base System release 2.7
CPU: Topology: Quad Core model: AMD Ryzen 5 2400G with Radeon Vega Graphics bits: 64 type: MT MCP arch: Zen
L2 cache: 2048 KiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 57490
Speed: 1706 MHz min/max: 1600/3600 MHz Core speeds (MHz): 1: 1706 2: 2587 3: 3209 4: 1675 5: 1708 6: 3318 7: 2136
8: 1592
Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] vendor: ASUSTeK
driver: amdgpu v: kernel bus ID: 01:00.0 chip ID: 1002:67ff
Device-2: AMD Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] vendor: ASUSTeK driver: amdgpu v: kernel
bus ID: 0a:00.0 chip ID: 1002:15dd
Display: server: X.Org 1.20.8 driver: amdgpu compositor: kwin_x11 resolution: 2560x1080~60Hz
OpenGL: renderer: AMD RAVEN (DRM 3.37.0 5.7.9 LLVM 10.0.0) v: 4.6 Mesa 20.1.3 direct render: Yes
Versions:
dev-libs/rocclr-3.5.0-r1
dev-libs/rocm-comgr-3.5.0
dev-libs/rocm-device-libs-3.5.1
dev-libs/rocm-opencl-runtime-3.5.0
dev-libs/rocr-runtime-3.5.0
dev-libs/roct-thunk-interface-3.6.0
dev-util/rocm-cmake-3.5.0
dev-util/rocminfo-3.5.0
sys-devel/llvm-roc-3.6.0
Problem
- I login into the node via SSH (because of the graphical system freeze, s.b.).
- I build the square example:
❯ make HIP_PATH=/usr HIPCC_VERBOSE=1
/usr/bin/hipify-perl square.cu > square.cpp
/usr/bin/hipcc square.cpp -o square.out
LoadLib(libhsa-ext-image64.so.1) failed: libhsa-ext-image64.so.1: cannot open shared object file: No such file or directory
rocminfo: /tmp/portage/dev-libs/rocr-runtime-3.5.0/work/ROCR-Runtime-rocm-3.5.0/src/core/runtime/amd_memory_region.cpp:72: static void amd::MemoryRegion::FreeKfdMemory(void*, size_t): Assertion `status == HSAKMT_STATUS_SUCCESS' failed.
Warning: The specified HIP target: gfx902 is unknown. Correct compilation is not guaranteed.
hipcc-cmd: /usr/lib/llvm/roc/bin/clang++ -D__HIP_ROCclr__ -std=c++11 -isystem /usr/lib/llvm/roc/lib/clang/11.0.0/include/.. -D__HIP_ROCclr__ -D__HIP_ARCH_GFX902__=1 --cuda-gpu-arch=gfx902 -D__HIP_ARCH_GFX803__=1 --cuda-gpu-arch=gfx803 -O3 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --hip-device-lib-path=/usr/lib -fhip-new-launch-api -L/usr/lib64 -O3 -lgcc_s -lgcc -lpthread -lm -x hip square.cpp -o square.out -Wl,--enable-new-dtags -Wl,--rpath=/usr/lib64:/usr/lib -lhip_hcc
- I execute the example:
❯ ./square.out
LoadLib(libhsa-ext-image64.so.1) failed: libhsa-ext-image64.so.1: cannot open shared object file: No such file or directory
LoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory
LoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory
info: running on device AMD Ryzen 5 2400G with Radeon Vega Graphics
info: allocate host mem ( 7.63 MB)
info: allocate device mem ( 7.63 MB)
error: 'hipErrorOutOfMemory'(2) at square.cpp:76
Afterwards my graphical system freezes and I need to REISUB.
This is reproducible every time I run ./square.out.
Regression
I never got HIP to work on this system. Still working on it. :)
Logs
Excerpts from the system journal of my last boot:
Jul 22 07:54:38 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 07:54:39 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 07:54:39 ernie kernel: [drm] VCE initialized successfully.
Jul 22 07:54:39 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 07:54:39 ernie kernel: Alloc host visible vram on small bar is not allowed
Jul 22 07:54:39 ernie systemd[1]: Started Process Core Dump (PID 1756750/UID 0).
Jul 22 07:54:39 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 07:54:39 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 07:54:39 ernie systemd-coredump[1756754]: Process 1756741 (rocminfo) of user 1000 dumped core.
Stack trace of thread 1756741:
#0 0x00007f71b44f2f91 raise (libc.so.6 + 0x38f91)
#1 0x00007f71b44dc537 abort (libc.so.6 + 0x22537)
#2 0x00007f71b44dc40f __assert_fail_base.cold (libc.so.6 + 0x2240f)
#3 0x00007f71b44eb3e2 __assert_fail (libc.so.6 + 0x313e2)
#4 0x00007f71b496c7d9 _ZN3amd12MemoryRegion13FreeKfdMemoryEPvm (libhsa-runtime64.so.1 + 0x4b7d9)
#5 0x00007f71b496d60d _ZNK3amd12MemoryRegion4FreeEPvm (libhsa-runtime64.so.1 + 0x4c60d)
#6 0x00007f71b49aff19 _ZN4core7Runtime10FreeMemoryEPv (libhsa-runtime64.so.1 + 0x8ef19)
#7 0x00007f71b49af568 _ZZN4core7Runtime13RegisterAgentEPNS_5AgentEENKUlPvE0_clES3_ (libhsa-runtime64.so.1 + 0x8e568)
#8 0x00007f71b49b7546 _ZSt13__invoke_implIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EET_St14__invoke_otherOT0_DpOT1_ (libhsa-runtime64.so.1 + 0x96546)
#9 0x00007f71b49b725c _ZSt10__invoke_rIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EENSt9enable_ifIXsrSt6__and_IJSt7is_voidIT_ESt14__is_invocableIT0_JDpT1_EEEE5valueESA_E4typeEOSD_DpOSE_ (libhsa-runtime64.so.1 + 0x9625c)
#10 0x00007f71b49b6d65 _ZNSt17_Function_handlerIFvPvEZN4core7Runtime13RegisterAgentEPNS2_5AgentEEUlS0_E0_E9_M_invokeERKSt9_Any_dataOS0_ (libhsa-runtime64.so.1 + 0x95d65)
#11 0x00007f71b4940087 _ZNKSt8functionIFvPvEEclES0_ (libhsa-runtime64.so.1 + 0x1f087)
#12 0x00007f71b4955454 _ZNK3amd8GpuAgent13ReleaseShaderEPvm (libhsa-runtime64.so.1 + 0x34454)
#13 0x00007f71b49547cb _ZN3amd8GpuAgentD2Ev (libhsa-runtime64.so.1 + 0x337cb)
#14 0x00007f71b4954960 _ZN3amd8GpuAgentD0Ev (libhsa-runtime64.so.1 + 0x33960)
#15 0x00007f71b49bd764 _ZNK12DeleteObjectclIN4core5AgentEEEvPKT_ (libhsa-runtime64.so.1 + 0x9c764)
#16 0x00007f71b49ba6a4 _ZSt8for_eachIN9__gnu_cxx17__normal_iteratorIPPN4core5AgentESt6vectorIS4_SaIS4_EEEE12DeleteObjectET0_T_SC_SB_ (libhsa-runtime64.so.1 + 0x996a4)
#17 0x00007f71b49b4c83 _ZN4core7Runtime6UnloadEv (libhsa-runtime64.so.1 + 0x93c83)
#18 0x00007f71b49af3a3 _ZN4core7Runtime7ReleaseEv (libhsa-runtime64.so.1 + 0x8e3a3)
#19 0x00007f71b4987452 _ZN3HSA13hsa_shut_downEv (libhsa-runtime64.so.1 + 0x66452)
#20 0x00007f71b49d1f92 hsa_shut_down (libhsa-runtime64.so.1 + 0xb0f92)
#21 0x00005620d9aaa931 main (rocminfo + 0x8931)
#22 0x00007f71b44ddcaa __libc_start_main (libc.so.6 + 0x23caa)
#23 0x00005620d9aa40ba _start (rocminfo + 0x20ba)
Stack trace of thread 1756749:
#0 0x00007f71b45af957 ioctl (libc.so.6 + 0xf5957)
#1 0x00007f71b447f800 kmtIoctl (libhsakmt.so.1 + 0xb800)
#2 0x00007f71b447991d hsaKmtWaitOnMultipleEvents (libhsakmt.so.1 + 0x591d)
#3 0x00007f71b49cba66 _ZN4core6Signal7WaitAnyEjPK12hsa_signal_sPK22hsa_signal_condition_tPKlm16hsa_wait_state_tPl (libhsa-runtime64.so.1 + 0xaaa66)
#4 0x00007f71b49972fa _ZN3AMD23hsa_amd_signal_wait_anyEjP12hsa_signal_sP22hsa_signal_condition_tPlm16hsa_wait_state_tS4_ (libhsa-runtime64.so.1 + 0x762fa)
#5 0x00007f71b49b3286 _ZN4core7Runtime15AsyncEventsLoopEPv (libhsa-runtime64.so.1 + 0x92286)
#6 0x00007f71b4936597 _ZN2os16ThreadTrampolineEPv (libhsa-runtime64.so.1 + 0x15597)
#7 0x00007f71b4688fea start_thread (libpthread.so.0 + 0x7fea)
#8 0x00007f71b45b8edf __clone (libc.so.6 + 0xfeedf)
Jul 22 08:00:23 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:00:23 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:00:23 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:00:23 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:00:23 ernie kernel: Alloc host visible vram on small bar is not allowed
Jul 22 08:00:23 ernie systemd[1]: Started Process Core Dump (PID 1764207/UID 0).
Jul 22 08:00:24 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 08:00:24 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 08:00:24 ernie systemd-coredump[1764209]: Process 1764173 (rocminfo) of user 1000 dumped core.
Stack trace of thread 1764173:
#0 0x00007f6d248aef91 raise (libc.so.6 + 0x38f91)
#1 0x00007f6d24898537 abort (libc.so.6 + 0x22537)
#2 0x00007f6d2489840f __assert_fail_base.cold (libc.so.6 + 0x2240f)
#3 0x00007f6d248a73e2 __assert_fail (libc.so.6 + 0x313e2)
#4 0x00007f6d24d287d9 _ZN3amd12MemoryRegion13FreeKfdMemoryEPvm (libhsa-runtime64.so.1 + 0x4b7d9)
#5 0x00007f6d24d2960d _ZNK3amd12MemoryRegion4FreeEPvm (libhsa-runtime64.so.1 + 0x4c60d)
#6 0x00007f6d24d6bf19 _ZN4core7Runtime10FreeMemoryEPv (libhsa-runtime64.so.1 + 0x8ef19)
#7 0x00007f6d24d6b568 _ZZN4core7Runtime13RegisterAgentEPNS_5AgentEENKUlPvE0_clES3_ (libhsa-runtime64.so.1 + 0x8e568)
#8 0x00007f6d24d73546 _ZSt13__invoke_implIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EET_St14__invoke_otherOT0_DpOT1_ (libhsa-runtime64.so.1 + 0x96546)
#9 0x00007f6d24d7325c _ZSt10__invoke_rIvRZN4core7Runtime13RegisterAgentEPNS0_5AgentEEUlPvE0_JS4_EENSt9enable_ifIXsrSt6__and_IJSt7is_voidIT_ESt14__is_invocableIT0_JDpT1_EEEE5valueESA_E4typeEOSD_DpOSE_ (libhsa-runtime64.so.1 + 0x9625c)
#10 0x00007f6d24d72d65 _ZNSt17_Function_handlerIFvPvEZN4core7Runtime13RegisterAgentEPNS2_5AgentEEUlS0_E0_E9_M_invokeERKSt9_Any_dataOS0_ (libhsa-runtime64.so.1 + 0x95d65)
#11 0x00007f6d24cfc087 _ZNKSt8functionIFvPvEEclES0_ (libhsa-runtime64.so.1 + 0x1f087)
#12 0x00007f6d24d11454 _ZNK3amd8GpuAgent13ReleaseShaderEPvm (libhsa-runtime64.so.1 + 0x34454)
#13 0x00007f6d24d107cb _ZN3amd8GpuAgentD2Ev (libhsa-runtime64.so.1 + 0x337cb)
#14 0x00007f6d24d10960 _ZN3amd8GpuAgentD0Ev (libhsa-runtime64.so.1 + 0x33960)
#15 0x00007f6d24d79764 _ZNK12DeleteObjectclIN4core5AgentEEEvPKT_ (libhsa-runtime64.so.1 + 0x9c764)
#16 0x00007f6d24d766a4 _ZSt8for_eachIN9__gnu_cxx17__normal_iteratorIPPN4core5AgentESt6vectorIS4_SaIS4_EEEE12DeleteObjectET0_T_SC_SB_ (libhsa-runtime64.so.1 + 0x996a4)
#17 0x00007f6d24d70c83 _ZN4core7Runtime6UnloadEv (libhsa-runtime64.so.1 + 0x93c83)
#18 0x00007f6d24d6b3a3 _ZN4core7Runtime7ReleaseEv (libhsa-runtime64.so.1 + 0x8e3a3)
#19 0x00007f6d24d43452 _ZN3HSA13hsa_shut_downEv (libhsa-runtime64.so.1 + 0x66452)
#20 0x00007f6d24d8df92 hsa_shut_down (libhsa-runtime64.so.1 + 0xb0f92)
#21 0x00005595cf192931 main (rocminfo + 0x8931)
#22 0x00007f6d24899caa __libc_start_main (libc.so.6 + 0x23caa)
#23 0x00005595cf18c0ba _start (rocminfo + 0x20ba)
Stack trace of thread 1764206:
#0 0x00007f6d2496b957 ioctl (libc.so.6 + 0xf5957)
#1 0x00007f6d2483b800 kmtIoctl (libhsakmt.so.1 + 0xb800)
#2 0x00007f6d2483591d hsaKmtWaitOnMultipleEvents (libhsakmt.so.1 + 0x591d)
#3 0x00007f6d24d87a66 _ZN4core6Signal7WaitAnyEjPK12hsa_signal_sPK22hsa_signal_condition_tPKlm16hsa_wait_state_tPl (libhsa-runtime64.so.1 + 0xaaa66)
#4 0x00007f6d24d532fa _ZN3AMD23hsa_amd_signal_wait_anyEjP12hsa_signal_sP22hsa_signal_condition_tPlm16hsa_wait_state_tS4_ (libhsa-runtime64.so.1 + 0x762fa)
#5 0x00007f6d24d6f286 _ZN4core7Runtime15AsyncEventsLoopEPv (libhsa-runtime64.so.1 + 0x92286)
#6 0x00007f6d24cf2597 _ZN2os16ThreadTrampolineEPv (libhsa-runtime64.so.1 + 0x15597)
#7 0x00007f6d24a44fea start_thread (libpthread.so.0 + 0x7fea)
#8 0x00007f6d24974edf __clone (libc.so.6 + 0xfeedf)
Jul 22 08:00:24 ernie systemd[1]: [email protected]: Succeeded.
Jul 22 08:01:00 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:01:00 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:01:00 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:01:00 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:01:00 ernie kernel: Alloc host visible vram on small bar is not allowed
Jul 22 08:01:00 ernie kernel: Evicting PASID 0x8026 queues
Jul 22 08:01:00 ernie kernel: Evicting PASID 0x8026 queues
Afterwards the system was running for a while without me interacting with it. When I came back, I couldn't access my X11 session anymore (system not reacting to keyboard input, like NumLock, switching to VT not possible), so I had to REISUB:
Jul 22 08:52:38 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:52:38 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:52:38 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:52:38 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:52:49 ernie kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:56:crtc-0] flip_done timed out
Jul 22 08:52:59 ernie kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:49:plane-3] flip_done timed out
Jul 22 08:53:40 ernie kernel: sysrq: Keyboard mode set to system default
Jul 22 08:53:41 ernie kernel: sysrq: Terminate All Tasks
Jul 22 08:53:41 ernie kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Jul 22 08:53:41 ernie kernel: bpfilter: Loaded bpfilter_umh pid 1811427
Jul 22 08:53:41 ernie kernel: [drm] UVD and UVD ENC initialized successfully.
Jul 22 08:53:41 ernie kernel: [drm] VCE initialized successfully.
Jul 22 08:53:41 ernie kernel: amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
Jul 22 08:53:42 ernie kernel: sysrq: Kill All Tasks
Jul 22 08:53:42 ernie kernel: ------------[ cut here ]------------
Jul 22 08:53:42 ernie kernel: WARNING: CPU: 6 PID: 1430 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:6787 amdgpu_dm_atomic_commit_tail+0x20bd/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Modules linked in: squashfs loop snd_seq_dummy snd_hrtimer snd_seq fuse nft_masq nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject >
Jul 22 08:53:42 ernie kernel: snd_hwdep gspca_vc032x uvcvideo gspca_main kvm asus_wmi ecdh_generic cmac videobuf2_vmalloc md4 videobuf2_memops amd_iommu_v2 battery gpu_sched videobuf2_v4l2 ecc crc16 videobuf2_common irqbypass sparse_keymap ttm snd_pcm pcspkr rfkill videodev wmi_bmof sp5100_tco k10temp i2c_piix4 drm_kms_helper joydev mc snd_timer mousedev input>
Jul 22 08:53:42 ernie kernel: pkcs8_key_parser
Jul 22 08:53:42 ernie kernel: CPU: 6 PID: 1430 Comm: X:sh5 Tainted: G T 5.7.9 #2
Jul 22 08:53:42 ernie kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 5406 11/13/2019
Jul 22 08:53:42 ernie kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x20bd/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Code: ff ff 41 8b 4c 24 60 48 c7 c2 60 26 2b c1 bf 02 00 00 00 48 c7 c6 80 81 32 c1 e8 5e 2f 4a ff 49 8b 4f 08 e9 bd e0 ff ff 0f 0b <0f> 0b e9 b0 ef ff ff 0f 0b e9 c9 ef ff ff 48 8b 85 68 fd ff ff 48
Jul 22 08:53:42 ernie kernel: RSP: 0018:ffffad3302d83870 EFLAGS: 00010002
Jul 22 08:53:42 ernie kernel: RAX: 0000000000000286 RBX: 0000000000000003 RCX: 0000000000000000
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000002 RSI: 0000000000000202 RDI: 0000000000000000
Jul 22 08:53:42 ernie kernel: RBP: ffffad3302d83b60 R08: 0000000000000005 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: ffffad3302d837d8 R11: ffffad3302d837dc R12: 0000000000000286
Jul 22 08:53:42 ernie kernel: R13: ffff9308d9249800 R14: ffff930651c83800 R15: ffff9308e4953080
Jul 22 08:53:42 ernie kernel: FS: 00007f919effd700(0000) GS:ffff9308f0780000(0000) knlGS:0000000000000000
Jul 22 08:53:42 ernie kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 22 08:53:42 ernie kernel: CR2: 0000556e92a96828 CR3: 0000000073a0a000 CR4: 00000000003406e0
Jul 22 08:53:42 ernie kernel: Call Trace:
Jul 22 08:53:42 ernie kernel: commit_tail+0x94/0x130 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: drm_client_modeset_commit_atomic+0x1c9/0x200 [drm]
Jul 22 08:53:42 ernie kernel: drm_client_modeset_commit_locked+0x54/0x150 [drm]
Jul 22 08:53:42 ernie kernel: drm_client_modeset_commit+0x24/0x40 [drm]
Jul 22 08:53:42 ernie kernel: drm_fb_helper_set_par+0xa5/0xd0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: drm_fb_helper_hotplug_event.part.0+0xa3/0xc0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: amdgpu_driver_lastclose_kms+0xa/0x10 [amdgpu]
Jul 22 08:53:42 ernie kernel: drm_release+0xd2/0x100 [drm]
Jul 22 08:53:42 ernie kernel: __fput+0xe5/0x250
Jul 22 08:53:42 ernie kernel: task_work_run+0x5f/0x80
Jul 22 08:53:42 ernie kernel: do_exit+0x363/0xb40
Jul 22 08:53:42 ernie kernel: do_group_exit+0x36/0xa0
Jul 22 08:53:42 ernie kernel: get_signal+0x148/0x920
Jul 22 08:53:42 ernie kernel: ? __handle_mm_fault+0xe54/0x18f0
Jul 22 08:53:42 ernie kernel: do_signal+0x3d/0x720
Jul 22 08:53:42 ernie kernel: ? preempt_count_add+0x49/0xa0
Jul 22 08:53:42 ernie kernel: prepare_exit_to_usermode+0xf2/0x170
Jul 22 08:53:42 ernie kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 22 08:53:42 ernie kernel: RIP: 0033:0x7f91b6011ad5
Jul 22 08:53:42 ernie kernel: Code: Bad RIP value.
Jul 22 08:53:42 ernie kernel: RSP: 002b:00007f919effcae0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Jul 22 08:53:42 ernie kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f91b6011ad5
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000055f789922a24
Jul 22 08:53:42 ernie kernel: RBP: 000055f7899229f8 R08: 0000000000000000 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f919effcb10
Jul 22 08:53:42 ernie kernel: R13: 000055f7899229d0 R14: 0000000000000001 R15: 000055f789922a24
Jul 22 08:53:42 ernie kernel: ---[ end trace 04201852eb3a754f ]---
Jul 22 08:53:42 ernie kernel: ------------[ cut here ]------------
Jul 22 08:53:42 ernie kernel: WARNING: CPU: 6 PID: 1430 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:6389 amdgpu_dm_atomic_commit_tail+0x20c4/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Modules linked in: squashfs loop snd_seq_dummy snd_hrtimer snd_seq fuse nft_masq nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject >
Jul 22 08:53:42 ernie kernel: snd_hwdep gspca_vc032x uvcvideo gspca_main kvm asus_wmi ecdh_generic cmac videobuf2_vmalloc md4 videobuf2_memops amd_iommu_v2 battery gpu_sched videobuf2_v4l2 ecc crc16 videobuf2_common irqbypass sparse_keymap ttm snd_pcm pcspkr rfkill videodev wmi_bmof sp5100_tco k10temp i2c_piix4 drm_kms_helper joydev mc snd_timer mousedev input>
Jul 22 08:53:42 ernie kernel: pkcs8_key_parser
Jul 22 08:53:42 ernie kernel: CPU: 6 PID: 1430 Comm: X:sh5 Tainted: G W T 5.7.9 #2
Jul 22 08:53:42 ernie kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 5406 11/13/2019
Jul 22 08:53:42 ernie kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x20c4/0x2230 [amdgpu]
Jul 22 08:53:42 ernie kernel: Code: 48 c7 c2 60 26 2b c1 bf 02 00 00 00 48 c7 c6 80 81 32 c1 e8 5e 2f 4a ff 49 8b 4f 08 e9 bd e0 ff ff 0f 0b 0f 0b e9 b0 ef ff ff <0f> 0b e9 c9 ef ff ff 48 8b 85 68 fd ff ff 48 8d 8d e0 fd ff ff 48
Jul 22 08:53:42 ernie kernel: RSP: 0018:ffffad3302d83870 EFLAGS: 00010082
Jul 22 08:53:42 ernie kernel: RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000000
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000002 RSI: 0000000000000202 RDI: 0000000000000000
Jul 22 08:53:42 ernie kernel: RBP: ffffad3302d83b60 R08: 0000000000000005 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: ffffad3302d837d8 R11: ffffad3302d837dc R12: 0000000000000286
Jul 22 08:53:42 ernie kernel: R13: ffff9308d9249800 R14: ffff930651c83800 R15: ffff9308e4953080
Jul 22 08:53:42 ernie kernel: FS: 00007f919effd700(0000) GS:ffff9308f0780000(0000) knlGS:0000000000000000
Jul 22 08:53:42 ernie kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 22 08:53:42 ernie kernel: CR2: 00007f91b6011aab CR3: 0000000073a0a000 CR4: 00000000003406e0
Jul 22 08:53:42 ernie kernel: Call Trace:
Jul 22 08:53:42 ernie kernel: commit_tail+0x94/0x130 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: drm_client_modeset_commit_atomic+0x1c9/0x200 [drm]
Jul 22 08:53:42 ernie kernel: drm_client_modeset_commit_locked+0x54/0x150 [drm]
Jul 22 08:53:42 ernie kernel: drm_client_modeset_commit+0x24/0x40 [drm]
Jul 22 08:53:42 ernie kernel: drm_fb_helper_set_par+0xa5/0xd0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: drm_fb_helper_hotplug_event.part.0+0xa3/0xc0 [drm_kms_helper]
Jul 22 08:53:42 ernie kernel: amdgpu_driver_lastclose_kms+0xa/0x10 [amdgpu]
Jul 22 08:53:42 ernie kernel: drm_release+0xd2/0x100 [drm]
Jul 22 08:53:42 ernie kernel: __fput+0xe5/0x250
Jul 22 08:53:42 ernie kernel: task_work_run+0x5f/0x80
Jul 22 08:53:42 ernie kernel: do_exit+0x363/0xb40
Jul 22 08:53:42 ernie kernel: do_group_exit+0x36/0xa0
Jul 22 08:53:42 ernie kernel: get_signal+0x148/0x920
Jul 22 08:53:42 ernie kernel: ? __handle_mm_fault+0xe54/0x18f0
Jul 22 08:53:42 ernie kernel: do_signal+0x3d/0x720
Jul 22 08:53:42 ernie kernel: ? preempt_count_add+0x49/0xa0
Jul 22 08:53:42 ernie kernel: prepare_exit_to_usermode+0xf2/0x170
Jul 22 08:53:42 ernie kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 22 08:53:42 ernie kernel: RIP: 0033:0x7f91b6011ad5
Jul 22 08:53:42 ernie kernel: Code: Bad RIP value.
Jul 22 08:53:42 ernie kernel: RSP: 002b:00007f919effcae0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Jul 22 08:53:42 ernie kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f91b6011ad5
Jul 22 08:53:42 ernie kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000055f789922a24
Jul 22 08:53:42 ernie kernel: RBP: 000055f7899229f8 R08: 0000000000000000 R09: 0000000000000000
Jul 22 08:53:42 ernie kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f919effcb10
Jul 22 08:53:42 ernie kernel: R13: 000055f7899229d0 R14: 0000000000000001 R15: 000055f789922a24
Jul 22 08:53:42 ernie kernel: ---[ end trace 04201852eb3a7550 ]---
Other information
I also see exceptions and segfaults in Clover and ROCm's OpenCL implementation when executing clinfo and rocminfo:
- https://gitlab.freedesktop.org/mesa/mesa/-/issues/3255
- https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues/32
- https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/97
I also see the system hanging in a very similar manner to this one when trying to use OpenCL from the JVM (running the Neanderthal examples), but since that is a lot more high level, I do not have a useful MWE for that. When trying this, I also regularly encountered OpenCL "out of memory" errors.
When trying this, I also regularly encountered OpenCL "out of memory" errors.
I encountered a similar error on ROCm 4.5.2. The first time I encountered a system freeze which appeared to be a result of running out of RAM (32 GBs)! After that whenever I try to run I just get hipErrorOutOfMemory
Maybe I need to try downgrading?
hipconfig:
HIP version : 4.4.21432-f9dccde4
== hipconfig
HIP_PATH : /opt/rocm-4.5.2/hip
ROCM_PATH : /opt/rocm-4.5.2
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME : rocclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-4.5.2/hip/include -I/opt/rocm-4.5.2/llvm/bin/../lib/clang/13.0.0 -I/opt/rocm-4.5.2/hsa/include
== hip-clang
HSA_PATH : /opt/rocm-4.5.2/hsa
HIP_CLANG_PATH : /opt/rocm-4.5.2/llvm/bin
AMD clang version 13.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-4.5.2 21432 9bbd96fd1936641cd47defd8022edafd063019d5)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.5.2/llvm/bin
AMD LLVM version 13.0.0git
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver2
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -std=c++11 -isystem "/opt/rocm-4.5.2/llvm/lib/clang/13.0.0/include/.." -isystem /opt/rocm-4.5.2/hsa/include -isystem "/opt/rocm-4.5.2/hip/include" -O3
hip-clang-ldflags : --driver-mode=g++ -L"/opt/rocm-4.5.2/hip/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt
=== Environment Variables
PATH=/home/user1/.vscode-server/bin/fe719cd3e5825bf14e14182fddeb88ee8daf044f/bin:/home/user1/.vscode-server/bin/fe719cd3e5825bf14e14182fddeb88ee8daf044f/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
== Linux Kernel
Hostname : roxane
Linux roxane 5.10.0-1052-oem #54-Ubuntu SMP Tue Nov 23 09:06:13 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
@devurandom, Sorry for the lack of response. Please try latest ROCm 6.0.2 (HIP 6.0.32831) to see if your issue still exists? If resolved, please close the ticket. Thanks.
Sorry, this has been too long and I no longer have access to that system.