rocBLAS icon indicating copy to clipboard operation
rocBLAS copied to clipboard

kernel oops for gfx1201 on Fedora Rawhide

Open trixirt opened this issue 6 months ago • 0 comments

Run this fedora container on a fedora rawhide host. https://github.com/trixirt/rocm-distro-containers/blob/main/fedora/rawhide/rocblas/check/Dockerfile

with the args docker run --device /dev/kfd --device /dev/dri -it --rm --cpus=1

This produces a backtrace ... [Detaching after vfork from child process 477] Memory access fault by GPU node-1 (Agent handle: 0x55ad5e2a1be0) on address 0x7f7c0123e000. Reason: Page not present or supervisor privilege. GPU core dump created: gpucore.16

Thread 15 "rocblas-test" received signal SIGABRT, Aborted. [Switching to Thread 0x7f7c00fff6c0 (LWP 32)] __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; Missing rpms, try: dnf --enablerepo='debug' install blas-debuginfo-3.12.0-8.fc42.x86_64 (gdb) bt #0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007f7c084d6cf3 in __pthread_kill_internal (threadid=, signo=6) at pthread_kill.c:89 #2 0x00007f7c0847cabe in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007f7c084646d0 in __GI_abort () at abort.c:73 #4 0x00007f7c080d0196 in rocr::core::Runtime::VMFaultHandler (val=, arg=) at /usr/src/debug/rocm-runtime-6.4.1-1.fc43.x86_64/runtime/hsa-runtime/core/runtime/runtime.cpp:1940 #5 0x00007f7c080ce7dc in operator() (__closure=, index=1, value=, wait_any=true) at /usr/src/debug/rocm-runtime-6.4.1-1.fc43.x86_64/runtime/hsa-runtime/core/runtime/runtime.cpp:1551 #6 rocr::core::Runtime::AsyncEventsLoop (_eventsInfo=0x55ad5e29f180) at /usr/src/debug/rocm-runtime-6.4.1-1.fc43.x86_64/runtime/hsa-runtime/core/runtime/runtime.cpp:1633 #7 0x00007f7c08058981 in rocr::os::ThreadTrampoline (arg=) at /usr/src/debug/rocm-runtime-6.4.1-1.fc43.x86_64/runtime/hsa-runtime/core/util/lnx/os_linux.cpp:86 #8 0x00007f7c084d4cc4 in start_thread (arg=) at pthread_create.c:448 #9 0x00007f7c08557494 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

And an oops on the host From dmesg

[ 702.395728] gmc_v12_0_process_interrupt: 94 callbacks suppressed [ 702.395732] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:8 pasid:32771) [ 702.395737] amdgpu 0000:03:00.0: amdgpu: in process rocblas-test pid 3814 thread rocblas-test pid 3814) [ 702.395738] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f7c0123e000 from client 10 [ 702.395740] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x0084115B [ 702.395741] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8) [ 702.395742] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 [ 702.395743] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5 [ 702.395743] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [ 702.395744] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 [ 702.395745] amdgpu 0000:03:00.0: amdgpu: RW: 0x1 [ 702.395752] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:8 pasid:32771) [ 702.395754] amdgpu 0000:03:00.0: amdgpu: in process rocblas-test pid 3814 thread rocblas-test pid 3814) [ 702.395755] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f7c0123e000 from client 10 [ 702.395762] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:8 pasid:32771) [ 702.395763] amdgpu 0000:03:00.0: amdgpu: in process rocblas-test pid 3814 thread rocblas-test pid 3814) [ 702.395764] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f7c0123e000 from client 10 [ 702.395772] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:8 pasid:32771) [ 702.395773] amdgpu 0000:03:00.0: amdgpu: in process rocblas-test pid 3814 thread rocblas-test pid 3814) [ 702.395774] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f7c0123e000 from client 10 [ 702.395781] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:8 pasid:32771) [ 702.395782] amdgpu 0000:03:00.0: amdgpu: in process rocblas-test pid 3814 thread rocblas-test pid 3814) [ 702.395783] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f7c0123e000 from client 10 [ 702.395790] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:8 pasid:32771) [ 702.395791] amdgpu 0000:03:00.0: amdgpu: in process rocblas-test pid 3814 thread rocblas-test pid 3814) [ 702.395792] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f7c0123e000 from client 10

The running kernel $ uname -a Linux fedora 6.16.0-0.rc0.250605gec7714e494790.13.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jun 6 09:52:12 UTC 2025 x86_64 GNU/Linux

rocminfo for the card ISA Info:


Agent 2


Name: gfx1201
Uuid: GPU-7b2a57bc7a036a5f
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 8192(0x2000) KB
L3: 65536(0x10000) KB
Chip ID: 29807(0x746f)
ASIC Revision: 1(0x1)
Cacheline Size: 256(0x100)
Max Clock Freq. (MHz): 2420
BDFID: 768
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 2
Shader Engines: 4
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 1012
SDMA engine uCode:: 838
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16695296(0xfec000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1201
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx12-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32

trixirt avatar Jun 11 '25 22:06 trixirt