Jakub Kuderski
Jakub Kuderski
@qedawkins @antiagainst ISA for the subgroup ops from llpc: Command: `amdllpc one_workgroup_argmax_subgroup_f32.comp -o /dev/null --gfxip=11.0 -v` Assembly: ``` .AMDGPU.disasm (size = 6467 bytes) _amdgpu_cs_main: BB0_0: s_getpc_b64 s[4:5] ; BE844700 s_mov_b32...
ISA for gfx90: ``` _amdgpu_cs_main: BB0_0: s_getpc_b64 s[4:5] ; BE841C00 s_mov_b32 s0, s1 ; BE800001 s_mov_b32 s1, s5 ; BE810005 s_load_dwordx8 s[4:11], s[0:1], 0x0 ; C00E0100 00000000 v_lshlrev_b32_e32 v6, 2,...
Closing due to inactivity
@jtuyls what are the padding amounts for these shapes? > 1024x128xf32, 2048x128xf32 and 4096x2048xf32
So we only pad the LHS? ```mlir %12 = iree_tensor_ext.dispatch.tensor.load %10, offsets = [0, 0], sizes = [%9, 2048], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor
OK let me run some benchmarks on my own, this is not what I saw in my past experiments
Ah right, that's why we don't expect to see the same speedup as in my initial experiments.
> These numbers are with tuning as well for both (--iree-codegen-enable-default-tuning-specs=true) Does this result in pingpong getting selected or not? If it does, then padding might interfere with the cache...
@jtuyls could you rerun this without pingpong?
Another issue with the test IR from https://github.com/iree-org/iree/issues/20835#issuecomment-2937559669 is that the final matmul is `(f32, f32) -> (f32)` and won't use any of the efficient mfma instructions because of the...