ROCK-Kernel-Driver
ROCK-Kernel-Driver copied to clipboard
(ppc64el) no-retry page fault: VM_L2_PROTECTION_FAULT
-- on every rocminfo call
amdgpu: update_gpuvm_pte() failed
amdgpu: SG Table of BO is UNEXPECTEDLY NULL
amdgpu: Failed to map bo to gpuvm
amdgpu 0000:03:00.0: amdgpu: Failed to map peer:0000:03:00.0 mem_domain:
-- occurs in hipblas
amdgpu: init_user_pages: Failed to get user pages: -1
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x000079c2a1fff000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801031
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x000079c2a1ffa000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x000079c2a1ff5000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0
amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2
amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
amdgpu: Resetting wave fronts (cpsch) on dev 00000000fa7830ec
However, I'm using amdgpu that came with 6.3.4 kernel & hipblas from rocm 5.3.2; does this mean that I would have to build the kernel from this repository, and how likely that it would help?
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: POWER9
Uuid: CPU-XX
Marketing Name: POWER9
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3800
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 64391040(0x3d68780) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 64391040(0x3d68780) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 64391040(0x3d68780) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx906
Uuid: GPU-bc4261817337ecd7
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26273(0x66a1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1725
BDFID: 768
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
@tucnak Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!