hcc
hcc copied to clipboard
DPP combining is not working for addition/subtraction in certain cases [HCC, HIP, LLVM]
This affects also HIP so maybe I should move this issue there. Actually it is rather LLVM issue. I put it here because I encountered this with hcc. Many issues here in hcc apply also for hip.
#include <hc.hpp>
int main()
{
hc::array_view<int> data(1);
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
int __amdgcn_update_dpp(int old, int src, int dpp_ctrl, int row_mask, int bank_mask, bool bound_ctrl) [[hc]] asm("llvm.amdgcn.update.dpp.i32");
int d = data[i[0]];
d = __amdgcn_update_dpp(0, d, 1, 14, 15, false) + d;
data[i[0]] = d;
});
return 0;
}
KMDUMPISA=1 hcc -hc main.cpp
dump-gfx900.isa:
v_mov_b32_dpp v2, v3 quad_perm:[1,0,0,0] row_mask:0xe bank_mask:0xf
v_add_u32_e32 v2, v2, v3
This is probably happening because when is run "GCN DPP Combine" pass v_add instruction is in form V_ADD_U32_e64 with immediate value 0 which seems DPP combine pass will not combine when row and bank masks are not full:
# After Instruction Selection:
%7:vgpr_32 = V_MOV_B32_dpp %6:vgpr_32(tied-def 0), killed %8:vgpr_32, 1, 1, 1, 0, implicit $exec
%9:vgpr_32 = V_ADD_U32_e64 killed %7:vgpr_32, %0:vgpr_32, 0, implicit $exec
...
# After SI Fold Operands:
%7:vgpr_32 = V_MOV_B32_dpp %6:vgpr_32(tied-def 0), killed %8:vgpr_32, 1, 1, 1, 0, implicit $exec
%9:vgpr_32 = V_ADD_U32_e64 killed %7:vgpr_32, %0:vgpr_32, 0, implicit $exec
# After GCN DPP Combine:
%7:vgpr_32 = V_MOV_B32_dpp %6:vgpr_32(tied-def 0), killed %8:vgpr_32, 1, 1, 1, 0, implicit $exec
%9:vgpr_32 = V_ADD_U32_e64 killed %7:vgpr_32, %0:vgpr_32, 0, implicit $exec
...
# After SI Shrink Instructions:
%7:vgpr_32 = V_MOV_B32_dpp %6:vgpr_32(tied-def 0), killed %8:vgpr_32, 1, 1, 1, 0, implicit $exec
%9:vgpr_32 = V_ADD_U32_e32 killed %7:vgpr_32, %0:vgpr_32, implicit $exec
Only later is v_add changed to V_ADD_U32_e32 and now would DPP combine work as shown bellow with dpp_combine.mir.
But this works:
# After SI Fold Operands:
%7:vgpr_32 = V_MOV_B32_dpp %6:vgpr_32(tied-def 0), killed %8:vgpr_32, 1, 15, 15, -1, implicit $exec
%9:vgpr_32 = V_ADD_U32_e64 killed %7:vgpr_32, %0:vgpr_32, 0, implicit $exec
# After GCN DPP Combine:
%9:vgpr_32 = V_ADD_U32_dpp %11:vgpr_32(tied-def 0), %8:vgpr_32, %0:vgpr_32, 1, 15, 15, 1, implicit $exec
When I change operation for example to xor or max:
d = __amdgcn_update_dpp(0, d, 1, 14, 15, false) ^ d;
d = std::max(__amdgcn_update_dpp(std::numeric_limits<int>::min(), d, 1, 14, 15, false), d);
Then is dpp combining working:
v_xor_b32_dpp v2, v2, v2 quad_perm:[1,0,0,0] row_mask:0xe bank_mask:0xf
v_max_i32_dpp v2, v2, v2 quad_perm:[1,0,0,0] row_mask:0xe bank_mask:0xf
Xor and max are working because old argument to llvm.amdgcn.update.dpp is identity for respective operation. When is source register out of bounds or masked by row or bank mask then __amdgcn_update_dpp will "return" identity and xor/max operation is nop and hence v_mov_dpp can be combined with v_xor into v_xor_dpp (which will behave equivalently).
In case of addition identity is zero so it should also work.
Test "old_is_0" from here demonstrates it: https://github.com/llvm/llvm-project/blob/master/llvm/test/CodeGen/AMDGPU/dpp_combine.mir
# CHECK: %10:vgpr_32 = V_ADD_U32_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec
%0:vgpr_32 = COPY $vgpr0
%1:vgpr_32 = COPY $vgpr1
%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
%9:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
%10:vgpr_32 = V_ADD_U32_e32 %9, %1, implicit $exec
Result from /opt/rocm/hcc/bin/llc -march=amdgcn -mcpu=gfx900 -run-pass=gcn-dpp-combine:
%10:vgpr_32 = V_ADD_U32_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec
Btw this combining is also happening on gfx803 where v_add modifies vcc if I am not wrong. But that probably does not matter if vcc from this v_add is not used.
But seems it is not working when translating from LLVM IR.
Also llvm.amdgcn.update.dpp is not the most happy solution because when I want for example implement parallel reduction using binary operation as template argument then I need to also define identity value for each possible binary operation. Ideally it should be easier to generate _dpp instructions without need to use identity.
Few more cases:
#include <hc.hpp>
int main()
{
hc::array_view<int> data(1);
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
int __amdgcn_update_dpp(int old, int src, int dpp_ctrl, int row_mask, int bank_mask, bool bound_ctrl) [[hc]] asm("llvm.amdgcn.update.dpp.i32");
int d = data[i[0]];
d = hc::__mul24(__amdgcn_update_dpp(1, d, 1, 14, 15, false), d);
data[i[0]] = d;
});
return 0;
}
v_mov_b32_dpp v2, v3 quad_perm:[1,0,0,0] row_mask:0xe bank_mask:0xf
v_mul_i32_i24_e32 v2, v2, v3
Although v_mul_32_24 is discussable but according to dpp_combine.mir it should work.
#include <hc.hpp>
int main()
{
hc::array_view<int> data(1);
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
asm("s_nop 0");
int __amdgcn_update_dpp(int old, int src, int dpp_ctrl, int row_mask, int bank_mask, bool bound_ctrl) [[hc]] asm("llvm.amdgcn.update.dpp.i32");
int d = data[0];
d = __amdgcn_update_dpp(0, d, 1, 14, 15, false) ^ d;
data[i[0]] = d;
});
return 0;
}
v_mov_b32_dpp v4, v2 quad_perm:[1,0,0,0] row_mask:0xe bank_mask:0xf
v_xor_b32_e32 v2, v4, v2
@b-sumner for awareness.