AMDMIGraphX
AMDMIGraphX copied to clipboard
Dangling quantizelinear from horizontal fusion, BERT and DistilGPT2
- Found during Inference Model Review meeting
- Seen in bert_base_cased and distilgpt2_fp16 run with our
--fp8flag and probably also--int8
@24 = gpu::code_object[code_object=8920,symbol_name=mlir_quantizelinear_quant_dot_dequantizelinear_add_add,global=1769472,local=256,](@18,@21,@23,@15,@22) -> half_type, {64, 384, 2304}, {884736, 2304, 1}
@25 = load[offset=603979776,end=622854144](@1) -> fp8e4m3fnuz_type, {64, 12, 64, 384}, {294912, 24576, 384, 1}
@26 = slice[axes={2},starts={768},ends={1536}](@24) -> half_type, {64, 384, 768}, {884736, 2304, 1}
@27 = reshape_lazy[dims={64, 384, 12, 64}](@26) -> half_type, {64, 384, 12, 64}, {884736, 2304, 64, 1}
@28 = transpose[permutation={0, 2, 3, 1}](@27) -> half_type, {64, 12, 64, 384}, {884736, 64, 1, 2304}
@29 = gpu::code_object[code_object=6816,symbol_name=quantizelinear_kernel,global=1179648,local=256,](@28,@25) -> fp8e4m3fnuz_type, {64, 12, 64, 384}, {294912, 24576, 384, 1}
@30 = load[offset=150994944,end=603979776](@1) -> float_type, {64, 12, 384, 384}, {1769472, 147456, 384, 1}
@31 = gpu::code_object[code_object=7000,symbol_name=mlir_slice_reshape_transpose_quantizelinear_quant_dot,global=3538944,local=256,](@24,@29,@30) -> float_type, {64, 12, 384, 384}, {1769472, 147456, 384, 1}
- Example from distilgpt2_fp16
- driver command:
bin/driver perf /codes/distilgpt2_1_fp16_gpu.onnx --fp8 --fill1 input_ids --input-dim @input_ids 64 384 --batch 64
- driver command:
- A horizontal fusion of the GEMM instructions occurred that produces the slice instructions
@26and@31. The quantizelinear kernel remains unfused.
Still seeing this dangling quantizelinear after FP8 OCP->FNUZ changes on MI300 but now it's merged with the elementwise kernels from the OCP->FNUZ conversion:
@26 = gpu::code_object[code_object=6592,symbol_name=quantizelinear_bit_cast_equal_where_equal_equal_logical_or_where│@207 = gpu::gemm[alpha=1,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@204,@206,@205) -> half_type, {64, 384
_kernel,global=18874368,local=1024,](@24,@25) -> fp8e4m3fnuz_type, {64, 12, 384, 64}, {294912, 64, 768, 1}: 0.080046│, 768}, {294912, 768, 1}: 0.326671ms, 5%
9ms, 1% │@208 = hip::hip_copy_literal[id=main:@literal:26] -> half_type, {768}, {1}: 0.00063802ms, 1%
@27 = load[offset=528482304,end=547356672](@1) -> fp8e4m3fnuz_type, {64, 12, 384, 64}, {294912, 64, 768, 1}: 0.00105│@209 = hip::hip_copy_literal[id=main:@literal:27] -> half_type, {768}, {1}: 0.00078066ms, 1%
982ms, 1% │@210 = hip::hip_copy_literal[id=main:@literal:28] -> half_type, {768}, {1}: 0.0007049ms, 1%
@28 = slice[axes={2},starts={0},ends={768}](@20) -> half_type, {64, 384, 768}, {884736, 2304, 1}: 0.00133238ms, 1% │@211 = multibroadcast[out_lens={24576, 768},out_dyn_dims={}](@210) -> half_type, {24576, 768}, {0, 1}: 0.00145816ms
@29 = reshape_lazy[dims={64, 384, 12, 64}](@28) -> half_type, {64, 384, 12, 64}, {884736, 2304, 64, 1}: 0.00114656ms│, 1%
, 1% │@212 = load[offset=0,end=37748736](@1) -> half_type, {24576, 768}, {768, 1}: 0.0009969ms, 1%
@30 = transpose[permutation={0, 2, 1, 3}](@29) -> half_type, {64, 12, 384, 64}, {884736, 64, 2304, 1}: 0.00090152ms,│@213 = reshape_lazy[dims={24576, 768}](@207) -> half_type, {24576, 768}, {768, 1}: 0.0015132ms, 1%
1% │@214 = gpu::code_object[code_object=5064,symbol_name=add_kernel,global=2359296,local=1024,](@213,@211,@212) -> half
@31 = gpu::code_object[code_object=6592,symbol_name=quantizelinear_bit_cast_equal_where_equal_equal_logical_or_where│_type, {24576, 768}, {768, 1}: 0.0338107ms, 1%
_kernel,global=18874368,local=1024,](@30,@27) -> fp8e4m3fnuz_type, {64, 12, 384, 64}, {294912, 64, 768, 1}: 0.080510│@215 = multibroadcast[out_lens={64, 384, 768},out_dyn_dims={}](@208) -> half_type, {64, 384, 768}, {0, 0, 1}: 0.001
4ms, 1% │39476ms, 1%
@32 = slice[axes={2},starts={768},ends={1536}](@20) -> half_type, {64, 384, 768}, {884736, 2304, 1}: 0.00155408ms, 1│@216 = multibroadcast[out_lens={64, 384, 768},out_dyn_dims={}](@209) -> half_type, {64, 384, 768}, {0, 0, 1}: 0.003
% │9168ms, 1%
@33 = reshape_lazy[dims={64, 384, 12, 64}](@32) -> half_type, {64, 384, 12, 64}, {884736, 2304, 64, 1}: 0.00104582ms│@217 = reshape_lazy[dims={64, 384, 768}](@214) -> half_type, {64, 384, 768}, {294912, 768, 1}: 0.00119782ms, 1%
, 1% │main:#output_0 = @param:main:#output_0 -> float_type, {64, 384, 768}, {294912, 768, 1}: 0.00112508ms, 1%
@34 = transpose[permutation={0, 2, 3, 1}](@33) -> half_type, {64, 12, 64, 384}, {884736, 64, 1, 2304}: 0.0008852ms, │@219 = gpu::code_object[code_object=6328,symbol_name=add_layernorm_mul_add_convert_kernel,global=3145728,local=128,
1% │](@196,@217,@216,@215,main:#output_0) -> float_type, {64, 384, 768}, {294912, 768, 1}: 0.0510618ms, 1%
@35 = load[offset=509607936,end=528482304](@1) -> fp8e4m3fnuz_type, {64, 12, 64, 384}, {294912, 64, 1, 768}: 0.00073│@220 = @return(@219)
57ms, 1% │Summary:
@36 = gpu::code_object[code_object=6592,symbol_name=quantizelinear_bit_cast_equal_where_equal_equal_logical_or_where│gpu::gemm: 1.96423ms / 6 = 0.327371ms, 26%
_kernel,global=18874368,local=1024,](@34,@35) -> fp8e4m3fnuz_type, {64, 12, 64, 384}, {294912, 64, 1, 768}: 0.082069│gpu::code_object::mlir_dot_add_add_mul_mul_add_mul_exp_add_div: 1.81733ms / 6 = 0.302889ms, 24%
3ms, 1% │gpu::code_object::mlir_dot_add_add: 1.32117ms / 6 = 0.220196ms, 17%
@37 = load[offset=56623104,end=509607936](@1) -> float_type, {64, 12, 384, 384}, {1769472, 147456, 384, 1}: 0.001074│gpu::code_object::mlir_slice_reshape_transpose_slice_reshape_transpose_dot_mul_where_softmax_slice_reshape_transpos
84ms, 1% │e_dot: 1.06223ms / 6 = 0.177039ms, 14%
@38 = gpu::code_object[code_object=5704,symbol_name=mlir_quant_dot,global=1769472,local=256,output_arg=2,](@31,@36,@│gpu::code_object::mlir_transpose_reshape_dot_reshape_add_add: 0.539443ms / 6 = 0.0899072ms, 7%
37) -> float_type, {64, 12, 384, 384}, {1769472, 147456, 384, 1}: 0.121349ms, 1%
With how the current performance report for fp8 and int8 on MI300 look this is a marginal effect current compared to the time taken on fp8/int8 GEMMs. Would be better to focus instead on improving the MLIR GEMM kernels or use hipBLASLt somehow.