AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

reorder_slice_add_mul matcher

Open aarushjain29 opened this issue 1 year ago • 13 comments

aarushjain29 avatar Sep 25 '24 18:09 aarushjain29

Codecov Report

Attention: Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.38%. Comparing base (768b4dd) to head (e9ba1c8).

Files with missing lines Patch % Lines
src/simplify_algebra.cpp 67.85% 9 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3478      +/-   ##
===========================================
- Coverage    92.41%   92.38%   -0.04%     
===========================================
  Files          520      520              
  Lines        22472    22498      +26     
===========================================
+ Hits         20767    20784      +17     
- Misses        1705     1714       +9     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 25 '24 20:09 codecov[bot]

Unit tests needs to be added.

pfultz2 avatar Oct 03 '24 19:10 pfultz2

Am I missing something for this PR or is it supposed to be this short?

CharlieL7 avatar Oct 30 '24 21:10 CharlieL7

mul_add -

> @39 = dot(@33,@37) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
 @41 = add(@39,@40) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
@42 = slice[axes={2},starts={0},ends={768}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@43 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@10) -> float_type, {1, 77, 768}, {0, 0, 1}
@44 = add(@43,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@45 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@0) -> float_type, {1, 77, 768}, {0, 0, 0}
@46 = mul(@44,@45) -> float_type, {1, 77, 768}, {59136, 768, 1}
@47 = slice[axes={2},starts={768},ends={1536}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@48 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@11) -> float_type, {1, 77, 768}, {0, 0, 1}
@49 = add(@48,@47) -> float_type, {1, 77, 768}, {59136, 768, 1}
@50 = reshape[dims={1, 77, 12, 64}](@46) -> float_type, {1, 77, 12, 64}, {59136, 768, 64, 1}

mul_add - after

> @41 = add(@39,@40) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
@42 = slice[axes={2},starts={0},ends={768}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@43 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@10) -> float_type, {1, 77, 768}, {0, 0, 1}
@44 = add(@43,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@45 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@0) -> float_type, {1, 77, 768}, {0, 0, 0}
@46 = mul(@45,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@47 = mul(@45,@43) -> float_type, {1, 77, 768}, {59136, 768, 1}
@48 = add(@46,@47) -> float_type, {1, 77, 768}, {59136, 768, 1}
@49 = slice[axes={2},starts={768},ends={1536}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@50 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@11) -> float_type, {1, 77, 768}, {0, 0, 1}
@51 = add(@50,@49) -> float_type, {1, 77, 768}, {59136, 768, 1}
@52 = reshape[dims={1, 77, 12, 64}](@48) -> float_type, {1, 77, 12, 64}, {59136, 768, 64, 1}

aarushjain29 avatar Nov 06 '24 16:11 aarushjain29

editted your comment @aarushijai to be readable on Github

CharlieL7 avatar Nov 06 '24 18:11 CharlieL7

Can we move the find_splits{} matcher in simplify_algebra.cpp up or have a copy of it run before find_mul_add{}? That should find the fusion opportunity over the slice instructions.

CharlieL7 avatar Nov 06 '24 19:11 CharlieL7

Can we move the find_splits{} matcher in simplify_algebra.cpp up or have a copy of it run before find_mul_add{}? That should find the fusion opportunity over the slice instructions.

No, because the rewrite happens before the slice is there. The snippet shows what happens after horizontal fusions not before. There is no slice in the original model, instead there are three dots that take the same inputs and each one will do three consecutive adds. Let me see if I can write a small test case to show the issue.

pfultz2 avatar Nov 06 '24 19:11 pfultz2

Before -

@16 = hip::hip_copy_literal[id=main:@literal:78] -> half_type, {768}, {1}: 0.00109522ms, 1%
@17 = hip::hip_copy_literal[id=main:@literal:59] -> half_type, {768}, {1}: 0.00108192ms, 1%
@18 = slice[axes={2},starts={768},ends={1536}](@15) -> half_type, {24, 77, 768}, {177408, 2304, 1}: 0.00165542ms, 1%
@19 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@17) -> half_type, {24, 77, 768}, {0, 0, 1}: 0.00094074ms, 1%
@20 = load[offset=18184320,end=21022848](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.00076536ms, 1%
**@21 = gpu::code_object[code_object=5128,symbol_name=add_kernel,global=354816,local=1024,](@19,@18,@20) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.0211362ms, 1%**
@22 = load[offset=11354112,end=14192640](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.00099472ms, 1%
@23 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@16) -> half_type, {24, 77, 768}, {0, 0, 1}: 0.00182424ms, 1%
@24 = slice[axes={2},starts={0},ends={768}](@15) -> half_type, {24, 77, 768}, {177408, 2304, 1}: 0.00103286ms, 1%
**@25 = gpu::code_object[code_object=5136,symbol_name=mul_add_kernel,global=354816,local=1024,](@24,@23,@22) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.0413997ms, 1%**
@26 = load[offset=14769216,end=18184320](@1) -> half_type, {24, 12, 77, 77}, {71148, 5929, 77, 1}: 0.00105ms, 1%
@27 = gpu::code_object[code_object=6736,symbol_name=mlir_reshape_transpose_reshape_transpose_dot,global=73728,local=256,](@25,@21,@26) -> half_type, {24, 12, 77, 77}, {71148, 5929, 77, 1}: 0.0248955ms, 1%
...
@32 = load[offset=14769216,end=17607744](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}
@33 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@31) -> half_type, {24, 77, 768}, {0, 0, 1}
@34 = slice[axes={2},starts={1536},ends={2304}](@14) -> half_type, {24, 77, 768}, {177408, 2304, 1}
**@35 = gpu::code_object[code_object=5128,symbol_name=add_kernel,global=354816,local=1024,](@33,@34,@32) -> half_type, {24, 77, 768}, {59136, 768, 1}**

Output with this PR -

@135 = gpu::code_object[code_object=6016,symbol_name=mlir_dot_add,global=23040,local=64,](@64,@133,@56,@134) -> half_type, {1, 77, 2304}, {177408, 2304, 1}
@136 = gpu::precompile_op[op=gpu::mlir_op[op=dot],additional_args=1,ignore_modules=0,output_shape=nullopt](@64,@133,@56,@134), [mlir_main:pointwise12] -> half_type, {1, 77, 2304}, {177408, 2304, 1}
@137 = hip::allocate[shape=half_type, {12, 77, 77}, {5929, 77, 1}] -> half_type, {12, 77, 77}, {5929, 77, 1}
@138 = gpu::code_object[code_object=5248,symbol_name=mlir_slice_mul_reshape_transpose_squeeze_slice_reshape_transpose_squeeze_dot,global=9600,local=32,](@135,@137) -> half_type, {12, 77, 77}, {5929, 77, 1}

aarushjain29 avatar Nov 08 '24 18:11 aarushjain29

Also, unit tests should be added to test for the original test case.

pfultz2 avatar Nov 19 '24 21:11 pfultz2

There is still a test case missing for the original issue found in the model where there is a add->dot->add being horizontally fused with one dot having an additional multiply.

pfultz2 avatar Nov 26 '24 15:11 pfultz2

Test Batch Rate new
d13d92
Rate old
3aee3a
Diff Compare
torchvision-resnet50 64 3,235.35 3,235.04 0.01% :white_check_mark:
torchvision-resnet50_fp16 64 6,879.20 6,877.66 0.02% :white_check_mark:
torchvision-densenet121 32 2,438.02 2,436.52 0.06% :white_check_mark:
torchvision-densenet121_fp16 32 4,203.26 4,189.80 0.32% :white_check_mark:
torchvision-inceptionv3 32 1,614.60 1,613.28 0.08% :white_check_mark:
torchvision-inceptionv3_fp16 32 2,678.90 2,677.62 0.05% :white_check_mark:
cadene-inceptionv4 16 750.56 750.18 0.05% :white_check_mark:
cadene-resnext64x4 16 809.32 808.82 0.06% :white_check_mark:
slim-mobilenet 64 6,661.39 6,663.29 -0.03% :white_check_mark:
slim-nasnetalarge 64 199.02 198.98 0.02% :white_check_mark:
slim-resnet50v2 64 3,425.74 3,424.55 0.03% :white_check_mark:
bert-mrpc-onnx 8 1,144.85 1,141.83 0.26% :white_check_mark:
bert-mrpc-tf 1 482.28 473.12 1.94% :white_check_mark:
pytorch-examples-wlang-gru 1 475.99 473.81 0.46% :white_check_mark:
pytorch-examples-wlang-lstm 1 437.30 426.27 2.59% :white_check_mark:
torchvision-resnet50_1 1 805.46 796.61 1.11% :white_check_mark:
cadene-dpn92_1 1 429.32 428.61 0.17% :white_check_mark:
cadene-resnext101_1 1 389.17 389.68 -0.13% :white_check_mark:
onnx-taau-downsample 1 371.37 372.29 -0.25% :white_check_mark:
dlrm-criteoterabyte 1 31.81 30.54 4.15% :high_brightness:
dlrm-criteoterabyte_fp16 1 51.02 49.05 4.01% :high_brightness:
agentmodel 1 8,615.42 8,978.43 -4.04% :red_circle:
unet_fp16 2 58.09 57.93 0.28% :white_check_mark:
resnet50v1_fp16 1 1,043.98 1,010.88 3.27% :high_brightness:
resnet50v1_int8 1 806.93 802.03 0.61% :white_check_mark:
bert_base_cased_fp16 64 1,172.91 1,172.65 0.02% :white_check_mark:
bert_large_uncased_fp16 32 362.37 362.27 0.03% :white_check_mark:
bert_large_fp16 1 200.45 198.55 0.96% :white_check_mark:
distilgpt2_fp16 16 2,216.05 2,215.63 0.02% :white_check_mark:
yolov5s 1 523.13 530.86 -1.46% :white_check_mark:
tinyllama 1 43.60 43.55 0.11% :white_check_mark:
vicuna-fastchat 1 43.89 43.92 -0.06% :white_check_mark:
whisper-tiny-encoder 1 411.90 410.51 0.34% :white_check_mark:
whisper-tiny-decoder 1 410.91 407.47 0.84% :white_check_mark:

This build is not recommended to merge :red_circle:

migraphx-bot avatar Feb 08 '25 09:02 migraphx-bot


     :white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance
     :white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance
     :white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance
     :white_check_mark: unet: PASSED: MIGraphX meets tolerance
     :white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

     :white_check_mark: bert_large: PASSED: MIGraphX meets tolerance
     :white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance
     :white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance
     :white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

migraphx-bot avatar Feb 08 '25 09:02 migraphx-bot

@pfultz2 Please review this so it can be merged

amd-mwu10004 avatar Aug 08 '25 16:08 amd-mwu10004