MIVisionX icon indicating copy to clipboard operation
MIVisionX copied to clipboard

[Issue]: OpenVX - AMD Custom Kernels: GPU Failures (HIP & OCL)

Open kiritigowda opened this issue 6 months ago • 2 comments

Problem Description

The list for GPU kernel failures

HIP

Convolve_S16_U8_3x9.gdf
Convolve_S16_U8_5x5.gdf
Convolve_S16_U8_7x7.gdf
Convolve_S16_U8_ANY_ANY.gdf
Convolve_U8_U8_3x9.gdf
Convolve_U8_U8_5x5.gdf
Convolve_U8_U8_7x7.gdf
Convolve_U8_U8_odd.gdf

OCL

Convolve_S16_U8_9x9.gdf
Convolve_U8_U8_9x9.gdf
Dilate_U1_U1_3x3.gdf
Dilate_U8_U1_3x3.gdf
Erode_U1_U1_3x3.gdf
Erode_U8_U1_3x3.gdf
WarpPerspective_U8_U8_Bilinear.gdf
WarpPerspective_U8_U8_Nearest.gdf

Operating System

ALL

CPU

ANY

GPU

AMD Instinct MI300

Other

No response

ROCm Version

ROCm 6.0.0

ROCm Component

MIVisionX

Steps to Reproduce

Use below GDFs to reproduce errors - runvx gdf

HIP

tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_S16_U8_3x9.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_S16_U8_5x5.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_S16_U8_7x7.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_S16_U8_odd.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_U8_U8_3x9.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_U8_U8_5x5.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_U8_U8_7x7.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAIL_Convolve_U8_U8_odd.gdf

OCL

tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_Convolve_S16_U8_9x9.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_Convolve_U8_U8_9x9.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_Dilate_U1_U1_3x3.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_Dilate_U8_U1_3x3.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_Erode_U1_U1_3x3.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_Erode_U8_U1_3x3.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_WarpPerspective_U8_U8_Bilinear.gdf
tests/amd_openvx_gdfs/cpu/hidden/GPU_FAILURE_OCL_WarpPerspective_U8_U8_Nearest.gdf

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

kiritigowda avatar Jun 18 '25 22:06 kiritigowda

HIP failure kernels also fail on OCL

Failures: Running GDF - 26:GPU_FAIL_Convolve_S16_U8_3x9.gdf
GPU_FAIL_Convolve_S16_U8_3x9.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_S16_U8_3x9.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,S016
data input_matrix = convolution:3,9
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:S016,1920,1080
data input_matrix = convolution:3,9
node com.amd.openvx.Convolve_S16_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0x3f271dd0) on address 0x77ca5e5ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)


Failures: Running GDF - 27:GPU_FAIL_Convolve_S16_U8_5x5.gdf
GPU_FAIL_Convolve_S16_U8_5x5.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_S16_U8_5x5.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,S016
data input_matrix = convolution:5,5:INIT,{-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;16;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1}
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:S016,1920,1080
data input_matrix = convolution:5,5
node com.amd.openvx.Convolve_S16_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0xf31e1d0) on address 0x7960ff5ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)


Failures: Running GDF - 28:GPU_FAIL_Convolve_S16_U8_7x7.gdf
GPU_FAIL_Convolve_S16_U8_7x7.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_S16_U8_7x7.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,S016
data input_matrix = convolution:7,7:INIT,{-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;16;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1}
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:S016,1920,1080
data input_matrix = convolution:7,7
node com.amd.openvx.Convolve_S16_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0x1ede0140) on address 0x7312d49ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)


Failures: Running GDF - 29:GPU_FAIL_Convolve_S16_U8_odd.gdf
GPU_FAIL_Convolve_S16_U8_odd.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_S16_U8_odd.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,S016
data input_matrix = convolution:9,7
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:S016,1920,1080
data input_matrix = convolution:9,7
node com.amd.openvx.Convolve_S16_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0x26510dd0) on address 0x736e7efff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)


Failures: Running GDF - 30:GPU_FAIL_Convolve_U8_U8_3x9.gdf
GPU_FAIL_Convolve_U8_U8_3x9.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_U8_U8_3x9.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,U008
data input_matrix = convolution:3,9
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:U008,1920,1080
data input_matrix = convolution:3,9
node com.amd.openvx.Convolve_U8_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0x203d8dc0) on address 0x73fe901ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)


Failures: Running GDF - 31:GPU_FAIL_Convolve_U8_U8_5x5.gdf
GPU_FAIL_Convolve_U8_U8_5x5.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_U8_U8_5x5.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,U008
data input_matrix = convolution:5,5:INIT,{-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;16;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1}
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:U008,1920,1080
data input_matrix = convolution:5,5
node com.amd.openvx.Convolve_U8_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0x2a4761c0) on address 0x72351c9ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)


Failures: Running GDF - 32:GPU_FAIL_Convolve_U8_U8_7x7.gdf
GPU_FAIL_Convolve_U8_U8_7x7.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_U8_U8_7x7.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,U008
data input_matrix = convolution:7,7:INIT,{-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;16;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1;-1}
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:U008,1920,1080
data input_matrix = convolution:7,7
node com.amd.openvx.Convolve_U8_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0x2426d130) on address 0x75bcef7ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)


Failures: Running GDF - 33:GPU_FAIL_Convolve_U8_U8_odd.gdf
GPU_FAIL_Convolve_U8_U8_odd.gdf
runvx 1.0.0
OK: using AMD OpenVX 1.3.0
include GPU_FAIL_Convolve_U8_U8_odd.gdf
data input_1 = uniform-image:1920,1080,U008,125
data output_1 = image:1920,1080,U008
data input_matrix = convolution:9,7
node org.khronos.openvx.custom_convolution input_1 input_matrix output_1
OK: OpenVX using GPU device - 0: gfx1030 [OpenCL 2.0 ] [CL_DEVICE_SVM_CAPABILITIES 0 0]
# ago graph dump BEGIN [internal]
data input_1 = image-uniform:U008,1920,1080,125
data output_1 = image:U008,1920,1080
data input_matrix = convolution:9,7
node com.amd.openvx.Convolve_U8_U8 output_1 input_1 input_matrix attr:AFFINITY:GPU,1
# ago graph dump END [internal]
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms,clread-ms
Memory access fault by GPU node-1 (Agent handle: 0xdfd1dc0) on address 0x7ddca19ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

kiritigowda avatar Jun 18 '25 22:06 kiritigowda

@AryanSalmanpour : Can you take a look at this issue. It is happening for out of bound mem access for HIP kernels.

rrawther avatar Jun 19 '25 17:06 rrawther