v_mac in gfx10 architectures not supported
gfx10 does not support the below instructions present in v4r1 and other kernels:
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:24:21: error: instruction not supported on this GPU
asm volatile("\n \
^
<inline asm>:2:14: note: instantiated into assembly here
v_mac_f32 v65, v95, v99
^
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:25:36: error: instruction not supported on this GPU
v_mac_f32 %0, %4, %5 \n \
^
<inline asm>:3:14: note: instantiated into assembly here
v_mac_f32 v63, v95, v100
^
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:26:36: error: instruction not supported on this GPU
v_mac_f32 %1, %4, %6 \n \
^
<inline asm>:4:14: note: instantiated into assembly here
v_mac_f32 v62, v95, v101
^
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:27:36: error: instruction not supported on this GPU
v_mac_f32 %2, %4, %7 \n \
^
<inline asm>:5:14: note: instantiated into assembly here
v_mac_f32 v61, v95, v102
^
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:24:21: error: instruction not supported on this GPU
asm volatile("\n \
^
<inline asm>:2:14: note: instantiated into assembly here
v_mac_f32 v56, v96, v99
^
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:25:36: error: instruction not supported on this GPU
v_mac_f32 %0, %4, %5 \n \
^
<inline asm>:3:14: note: instantiated into assembly here
v_mac_f32 v55, v96, v100
^
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:26:36: error: instruction not supported on this GPU
v_mac_f32 %1, %4, %6 \n \
^
<inline asm>:4:14: note: instantiated into assembly here
v_mac_f32 v54, v96, v101
^
In file included from gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:22:
./amd_inline_asm.hpp:27:36: error: instruction not supported on this GPU
v_mac_f32 %2, %4, %7 \n \
^
<inline asm>:5:14: note: instantiated into assembly here
v_mac_f32 v53, v96, v102
Hi @daniellowell v_mac_f32 should be a valid instruction in gfx10 ISA, I just tested in assembler level: echo "v_mac_f32 v2, v3, v4" | /opt/rocm/llvm/bin/llvm-mc -arch=amdgcn -mcpu=gfx1000 -show-encoding -show-inst on our rocm3.7
Maybe the flag not proper to compile gfx10?
IIRC VOP2 v_mac_v32 should be valid for all gfx10 parts, see https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX10.html. Please inform Dmitry Preobrazhensky if assembler has issues with it.
Maybe there is an issue in the high-level compiler (that can be considered as an intermediate layer between inline assembly code and llvm-mc layer). If something is wrong, please open Jira ticket.
If there is a compiler or assembler issue, it is possible to use v_mad_f32 or v_fmac_f32 as a workaround, I think.
Ah, gfx1030 should have deprecated both v_mad_f32 and v_mac_f32, need use v_fmac_f32 instead
@ltqin Please give me an ETA on when you think this can be completed.
I recommend extending inst_wrappers.inc with _v_mac_f32 macro and using it in the kernels.
@daniellowell The task seems simple, but there are some strange problems in the test. If the task is not urgent, I will finish it by November 15th. Is that ok?
Directly using v_fmac_f32 replaces v_mac_f32,it can be compiled on gfx1030, but the running results can not be verified, and the hip version of fp32 also fails to pass the verification. but fp16 is correct. Next I will confirm whether there is a problem with the installation environment.
but the running results can not be verified
Kernels with inline v_fmac_f32 fail verification?
the hip version of fp32 also fails...
Do you mean "kernels without inline assembly code"?
but the running results can not be verified
Kernels with inline
v_fmac_f32fail verification?
YES
the hip version of fp32 also fails...
Do you mean "kernels without inline assembly code"?
YES
the hip version of fp32 also fails...
Do you mean "kernels without inline assembly code"?
YES
Then there is general HIP compilation problem.
Most likely, the v_mac_f32 -> v_fmac_f32 substituion is correct. Actually, it can't be incorrect, except cases that VERY sensitive to precision. Which is not the case for convolutions. Please go ahead with v_mac_f32 -> v_fmac_f32 for gfx10.
@ltqin [off-topic][githib formatting] Please use one empty line after citation mark, otherwise formatting will be incorrect. Valid:
>> Coala eats
>
> shoots
and leaves.
Incorrect:
>> Coala eats
> shoots
and leaves.
@atamazov Okay, I got it
when set the flag "CK_USE_AMD_BUFFER_ADDRESSING" to zero, the test pass (both with inline v_fmac_f32 and without inline assembly code). does "amdgcn_buffer_load_f32X" not work for gfx1030.
@ltqin Please disable buffer_load for gfx1030.
@asroy
Please disable buffer_load for gfx1030.
This should be workaround (due to compiler issues), because buffer insns are supported on gfx1030, right?
@ltqin Is this still an issue with ROCm 6.1.1? If not, can we close the bug? Thanks!