AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

GEMM -> pointwise (GELU) -> GEMM fusion

Open CharlieL7 opened this issue 1 year ago • 0 comments

From the 22 Feb 2024 performance model review of Distilgpt2:

what Paul had suggested but it can go further because pointwise is also used once. e.g. pointwise kernel @55 here is only used for @57.

therefore it can be gemm+pointwise+gemm fusion

@50 = gpu::code_object[code_object=6584,symbol_name=mlir_reshape_dot,global=36864,local=256,](@44,@48,@49) -> half_type, {348, 3072}, {3072, 1}, target_id=0: 0.0281931ms, 2%
@51 = reshape_lazy[dims={1, 348, 3072}](@50) -> half_type, {1, 348, 3072}, {1069056, 3072, 1}, target_id=0: 0.00051212ms, 1%
@52 = multibroadcast[out_lens={348, 3072},out_dyn_dims={}](@47) -> half_type, {348, 3072}, {0, 1}, target_id=0: 0.00054692ms, 1%
@53 = reshape_lazy[dims={1, 348, 3072}](@52) -> half_type, {1, 348, 3072}, {0, 0, 1}, target_id=0: 0.00049298ms, 1%
@54 = load[offset=1069056,end=3207168](@1) -> half_type, {1, 348, 3072}, {1069056, 3072, 1}, target_id=0: 0.00038882ms, 1%
@55 = gpu::code_object[code_object=5136,symbol_name=add_mul_mul_mul_mul_add_neg_sub_exp_add_div_mul_kernel,global=534528,local=1024,](@51,@53,@54) -> half_type, {1, 348, 3072}, {1069056, 3072, 1}, target_id=0: 0.0134347ms, 1%
@56 = load[offset=3207168,end=3741696](@1) -> half_type, {348, 768}, {768, 1}, target_id=0: 0.00054628ms, 1%
@57 = gpu::code_object[code_object=5240,symbol_name=mlir_reshape_dot,global=67584,local=256,](@55,@46,@56) -> half_type, {348, 768}, {768, 1}, target_id=0: 0.0325462ms, 3%
  • Possible to fuse with MLIR with split_k. Might not be a performance improvement however with how the k dimension (3072) compares to the row dimension (348).

Deliverables:

  • Will have to communicate with MLIR if this fusion would be better and if it would be supported
  • If it can be better do the fusion

CharlieL7 avatar Feb 22 '24 21:02 CharlieL7