AMDMIGraphX reorder_slice_add

Sep 25 '24 18:09 aarushjain29

Codecov Report

Attention: Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.38%. Comparing base (768b4dd) to head (e9ba1c8).

Files with missing lines	Patch %	Lines
src/simplify_algebra.cpp	67.85%	9 Missing :warning:

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3478      +/-   ##
===========================================
- Coverage    92.41%   92.38%   -0.04%     
===========================================
  Files          520      520              
  Lines        22472    22498      +26     
===========================================
+ Hits         20767    20784      +17     
- Misses        1705     1714       +9

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sep 25 '24 20:09 codecov[bot]

Unit tests needs to be added.

Oct 03 '24 19:10 pfultz2

Am I missing something for this PR or is it supposed to be this short?

Oct 30 '24 21:10 CharlieL7

mul_add -

> @39 = dot(@33,@37) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
 @41 = add(@39,@40) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
@42 = slice[axes={2},starts={0},ends={768}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@43 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@10) -> float_type, {1, 77, 768}, {0, 0, 1}
@44 = add(@43,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@45 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@0) -> float_type, {1, 77, 768}, {0, 0, 0}
@46 = mul(@44,@45) -> float_type, {1, 77, 768}, {59136, 768, 1}
@47 = slice[axes={2},starts={768},ends={1536}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@48 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@11) -> float_type, {1, 77, 768}, {0, 0, 1}
@49 = add(@48,@47) -> float_type, {1, 77, 768}, {59136, 768, 1}
@50 = reshape[dims={1, 77, 12, 64}](@46) -> float_type, {1, 77, 12, 64}, {59136, 768, 64, 1}

mul_add - after

> @41 = add(@39,@40) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
@42 = slice[axes={2},starts={0},ends={768}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@43 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@10) -> float_type, {1, 77, 768}, {0, 0, 1}
@44 = add(@43,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@45 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@0) -> float_type, {1, 77, 768}, {0, 0, 0}
@46 = mul(@45,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@47 = mul(@45,@43) -> float_type, {1, 77, 768}, {59136, 768, 1}
@48 = add(@46,@47) -> float_type, {1, 77, 768}, {59136, 768, 1}
@49 = slice[axes={2},starts={768},ends={1536}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@50 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@11) -> float_type, {1, 77, 768}, {0, 0, 1}
@51 = add(@50,@49) -> float_type, {1, 77, 768}, {59136, 768, 1}
@52 = reshape[dims={1, 77, 12, 64}](@48) -> float_type, {1, 77, 12, 64}, {59136, 768, 64, 1}

Nov 06 '24 16:11 aarushjain29

editted your comment @aarushijai to be readable on Github

Nov 06 '24 18:11 CharlieL7

Can we move the find_splits{} matcher in simplify_algebra.cpp up or have a copy of it run before find_mul_add{}? That should find the fusion opportunity over the slice instructions.

Nov 06 '24 19:11 CharlieL7

Can we move the find_splits{} matcher in simplify_algebra.cpp up or have a copy of it run before find_mul_add{}? That should find the fusion opportunity over the slice instructions.

No, because the rewrite happens before the slice is there. The snippet shows what happens after horizontal fusions not before. There is no slice in the original model, instead there are three dots that take the same inputs and each one will do three consecutive adds. Let me see if I can write a small test case to show the issue.

Nov 06 '24 19:11 pfultz2

Before -

@16 = hip::hip_copy_literal[id=main:@literal:78] -> half_type, {768}, {1}: 0.00109522ms, 1%
@17 = hip::hip_copy_literal[id=main:@literal:59] -> half_type, {768}, {1}: 0.00108192ms, 1%
@18 = slice[axes={2},starts={768},ends={1536}](@15) -> half_type, {24, 77, 768}, {177408, 2304, 1}: 0.00165542ms, 1%
@19 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@17) -> half_type, {24, 77, 768}, {0, 0, 1}: 0.00094074ms, 1%
@20 = load[offset=18184320,end=21022848](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.00076536ms, 1%
**@21 = gpu::code_object[code_object=5128,symbol_name=add_kernel,global=354816,local=1024,](@19,@18,@20) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.0211362ms, 1%**
@22 = load[offset=11354112,end=14192640](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.00099472ms, 1%
@23 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@16) -> half_type, {24, 77, 768}, {0, 0, 1}: 0.00182424ms, 1%
@24 = slice[axes={2},starts={0},ends={768}](@15) -> half_type, {24, 77, 768}, {177408, 2304, 1}: 0.00103286ms, 1%
**@25 = gpu::code_object[code_object=5136,symbol_name=mul_add_kernel,global=354816,local=1024,](@24,@23,@22) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.0413997ms, 1%**
@26 = load[offset=14769216,end=18184320](@1) -> half_type, {24, 12, 77, 77}, {71148, 5929, 77, 1}: 0.00105ms, 1%
@27 = gpu::code_object[code_object=6736,symbol_name=mlir_reshape_transpose_reshape_transpose_dot,global=73728,local=256,](@25,@21,@26) -> half_type, {24, 12, 77, 77}, {71148, 5929, 77, 1}: 0.0248955ms, 1%
...
@32 = load[offset=14769216,end=17607744](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}
@33 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@31) -> half_type, {24, 77, 768}, {0, 0, 1}
@34 = slice[axes={2},starts={1536},ends={2304}](@14) -> half_type, {24, 77, 768}, {177408, 2304, 1}
**@35 = gpu::code_object[code_object=5128,symbol_name=add_kernel,global=354816,local=1024,](@33,@34,@32) -> half_type, {24, 77, 768}, {59136, 768, 1}**

Output with this PR -

@135 = gpu::code_object[code_object=6016,symbol_name=mlir_dot_add,global=23040,local=64,](@64,@133,@56,@134) -> half_type, {1, 77, 2304}, {177408, 2304, 1}
@136 = gpu::precompile_op[op=gpu::mlir_op[op=dot],additional_args=1,ignore_modules=0,output_shape=nullopt](@64,@133,@56,@134), [mlir_main:pointwise12] -> half_type, {1, 77, 2304}, {177408, 2304, 1}
@137 = hip::allocate[shape=half_type, {12, 77, 77}, {5929, 77, 1}] -> half_type, {12, 77, 77}, {5929, 77, 1}
@138 = gpu::code_object[code_object=5248,symbol_name=mlir_slice_mul_reshape_transpose_squeeze_slice_reshape_transpose_squeeze_dot,global=9600,local=32,](@135,@137) -> half_type, {12, 77, 77}, {5929, 77, 1}

Nov 08 '24 18:11 aarushjain29

Also, unit tests should be added to test for the original test case.

Nov 19 '24 21:11 pfultz2

There is still a test case missing for the original issue found in the model where there is a add->dot->add being horizontally fused with one dot having an additional multiply.

Nov 26 '24 15:11 pfultz2

Test Batch Rate new
d13d92 Rate old
3aee3a Diff Compare

torchvision-resnet50 64 3,235.35 3,235.04 0.01% :white_check_mark:

torchvision-resnet50_fp16 64 6,879.20 6,877.66 0.02% :white_check_mark:

torchvision-densenet121 32 2,438.02 2,436.52 0.06% :white_check_mark:

torchvision-densenet121_fp16 32 4,203.26 4,189.80 0.32% :white_check_mark:

torchvision-inceptionv3 32 1,614.60 1,613.28 0.08% :white_check_mark:

torchvision-inceptionv3_fp16 32 2,678.90 2,677.62 0.05% :white_check_mark:

cadene-inceptionv4 16 750.56 750.18 0.05% :white_check_mark:

cadene-resnext64x4 16 809.32 808.82 0.06% :white_check_mark:

slim-mobilenet 64 6,661.39 6,663.29 -0.03% :white_check_mark:

slim-nasnetalarge 64 199.02 198.98 0.02% :white_check_mark:

slim-resnet50v2 64 3,425.74 3,424.55 0.03% :white_check_mark:

bert-mrpc-onnx 8 1,144.85 1,141.83 0.26% :white_check_mark:

bert-mrpc-tf 1 482.28 473.12 1.94% :white_check_mark:

pytorch-examples-wlang-gru 1 475.99 473.81 0.46% :white_check_mark:

pytorch-examples-wlang-lstm 1 437.30 426.27 2.59% :white_check_mark:

torchvision-resnet50_1 1 805.46 796.61 1.11% :white_check_mark:

cadene-dpn92_1 1 429.32 428.61 0.17% :white_check_mark:

cadene-resnext101_1 1 389.17 389.68 -0.13% :white_check_mark:

onnx-taau-downsample 1 371.37 372.29 -0.25% :white_check_mark:

dlrm-criteoterabyte 1 31.81 30.54 4.15% :high_brightness:

dlrm-criteoterabyte_fp16 1 51.02 49.05 4.01% :high_brightness:

agentmodel 1 8,615.42 8,978.43 -4.04% :red_circle:

unet_fp16 2 58.09 57.93 0.28% :white_check_mark:

resnet50v1_fp16 1 1,043.98 1,010.88 3.27% :high_brightness:

resnet50v1_int8 1 806.93 802.03 0.61% :white_check_mark:

bert_base_cased_fp16 64 1,172.91 1,172.65 0.02% :white_check_mark:

bert_large_uncased_fp16 32 362.37 362.27 0.03% :white_check_mark:

bert_large_fp16 1 200.45 198.55 0.96% :white_check_mark:

distilgpt2_fp16 16 2,216.05 2,215.63 0.02% :white_check_mark:

yolov5s 1 523.13 530.86 -1.46% :white_check_mark:

tinyllama 1 43.60 43.55 0.11% :white_check_mark:

vicuna-fastchat 1 43.89 43.92 -0.06% :white_check_mark:

whisper-tiny-encoder 1 411.90 410.51 0.34% :white_check_mark:

whisper-tiny-decoder 1 410.91 407.47 0.84% :white_check_mark:

Test	Batch	Rate new d13d92	Rate old 3aee3a	Diff	Compare
torchvision-resnet50	64	3,235.35	3,235.04	0.01%	:white_check_mark:
torchvision-resnet50_fp16	64	6,879.20	6,877.66	0.02%	:white_check_mark:
torchvision-densenet121	32	2,438.02	2,436.52	0.06%	:white_check_mark:
torchvision-densenet121_fp16	32	4,203.26	4,189.80	0.32%	:white_check_mark:
torchvision-inceptionv3	32	1,614.60	1,613.28	0.08%	:white_check_mark:
torchvision-inceptionv3_fp16	32	2,678.90	2,677.62	0.05%	:white_check_mark:
cadene-inceptionv4	16	750.56	750.18	0.05%	:white_check_mark:
cadene-resnext64x4	16	809.32	808.82	0.06%	:white_check_mark:
slim-mobilenet	64	6,661.39	6,663.29	-0.03%	:white_check_mark:
slim-nasnetalarge	64	199.02	198.98	0.02%	:white_check_mark:
slim-resnet50v2	64	3,425.74	3,424.55	0.03%	:white_check_mark:
bert-mrpc-onnx	8	1,144.85	1,141.83	0.26%	:white_check_mark:
bert-mrpc-tf	1	482.28	473.12	1.94%	:white_check_mark:
pytorch-examples-wlang-gru	1	475.99	473.81	0.46%	:white_check_mark:
pytorch-examples-wlang-lstm	1	437.30	426.27	2.59%	:white_check_mark:
torchvision-resnet50_1	1	805.46	796.61	1.11%	:white_check_mark:
cadene-dpn92_1	1	429.32	428.61	0.17%	:white_check_mark:
cadene-resnext101_1	1	389.17	389.68	-0.13%	:white_check_mark:
onnx-taau-downsample	1	371.37	372.29	-0.25%	:white_check_mark:
dlrm-criteoterabyte	1	31.81	30.54	4.15%	:high_brightness:
dlrm-criteoterabyte_fp16	1	51.02	49.05	4.01%	:high_brightness:
agentmodel	1	8,615.42	8,978.43	-4.04%	:red_circle:
unet_fp16	2	58.09	57.93	0.28%	:white_check_mark:
resnet50v1_fp16	1	1,043.98	1,010.88	3.27%	:high_brightness:
resnet50v1_int8	1	806.93	802.03	0.61%	:white_check_mark:
bert_base_cased_fp16	64	1,172.91	1,172.65	0.02%	:white_check_mark:
bert_large_uncased_fp16	32	362.37	362.27	0.03%	:white_check_mark:
bert_large_fp16	1	200.45	198.55	0.96%	:white_check_mark:
distilgpt2_fp16	16	2,216.05	2,215.63	0.02%	:white_check_mark:
yolov5s	1	523.13	530.86	-1.46%	:white_check_mark:
tinyllama	1	43.60	43.55	0.11%	:white_check_mark:
vicuna-fastchat	1	43.89	43.92	-0.06%	:white_check_mark:
whisper-tiny-encoder	1	411.90	410.51	0.34%	:white_check_mark:
whisper-tiny-decoder	1	410.91	407.47	0.84%	:white_check_mark:

This build is not recommended to merge :red_circle:

Feb 08 '25 09:02 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Feb 08 '25 09:02 migraphx-bot

@pfultz2 Please review this so it can be merged

Aug 08 '25 16:08 amd-mwu10004

AMDMIGraphX
AMDMIGraphX copied to clipboard

reorder_slice_add_mul matcher

Codecov Report

AMDMIGraphX AMDMIGraphX copied to clipboard

reorder_slice_add_mul matcher

Codecov Report

AMDMIGraphX
AMDMIGraphX copied to clipboard