AMDMIGraphX
AMDMIGraphX copied to clipboard
reorder_slice_add_mul matcher
Codecov Report
Attention: Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.
Project coverage is 92.38%. Comparing base (
768b4dd) to head (e9ba1c8).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/simplify_algebra.cpp | 67.85% | 9 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## develop #3478 +/- ##
===========================================
- Coverage 92.41% 92.38% -0.04%
===========================================
Files 520 520
Lines 22472 22498 +26
===========================================
+ Hits 20767 20784 +17
- Misses 1705 1714 +9
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
🚀 New features to boost your workflow:
- ❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Unit tests needs to be added.
Am I missing something for this PR or is it supposed to be this short?
mul_add -
> @39 = dot(@33,@37) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
@41 = add(@39,@40) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
@42 = slice[axes={2},starts={0},ends={768}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@43 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@10) -> float_type, {1, 77, 768}, {0, 0, 1}
@44 = add(@43,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@45 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@0) -> float_type, {1, 77, 768}, {0, 0, 0}
@46 = mul(@44,@45) -> float_type, {1, 77, 768}, {59136, 768, 1}
@47 = slice[axes={2},starts={768},ends={1536}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@48 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@11) -> float_type, {1, 77, 768}, {0, 0, 1}
@49 = add(@48,@47) -> float_type, {1, 77, 768}, {59136, 768, 1}
@50 = reshape[dims={1, 77, 12, 64}](@46) -> float_type, {1, 77, 12, 64}, {59136, 768, 64, 1}
mul_add - after
> @41 = add(@39,@40) -> float_type, {1, 77, 1536}, {118272, 1536, 1}
@42 = slice[axes={2},starts={0},ends={768}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@43 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@10) -> float_type, {1, 77, 768}, {0, 0, 1}
@44 = add(@43,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@45 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@0) -> float_type, {1, 77, 768}, {0, 0, 0}
@46 = mul(@45,@42) -> float_type, {1, 77, 768}, {59136, 768, 1}
@47 = mul(@45,@43) -> float_type, {1, 77, 768}, {59136, 768, 1}
@48 = add(@46,@47) -> float_type, {1, 77, 768}, {59136, 768, 1}
@49 = slice[axes={2},starts={768},ends={1536}](@41) -> float_type, {1, 77, 768}, {118272, 1536, 1}
@50 = multibroadcast[out_lens={1, 77, 768},out_dyn_dims={}](@11) -> float_type, {1, 77, 768}, {0, 0, 1}
@51 = add(@50,@49) -> float_type, {1, 77, 768}, {59136, 768, 1}
@52 = reshape[dims={1, 77, 12, 64}](@48) -> float_type, {1, 77, 12, 64}, {59136, 768, 64, 1}
editted your comment @aarushijai to be readable on Github
Can we move the find_splits{} matcher in simplify_algebra.cpp up or have a copy of it run before find_mul_add{}? That should find the fusion opportunity over the slice instructions.
Can we move the find_splits{} matcher in simplify_algebra.cpp up or have a copy of it run before find_mul_add{}? That should find the fusion opportunity over the slice instructions.
No, because the rewrite happens before the slice is there. The snippet shows what happens after horizontal fusions not before. There is no slice in the original model, instead there are three dots that take the same inputs and each one will do three consecutive adds. Let me see if I can write a small test case to show the issue.
Before -
@16 = hip::hip_copy_literal[id=main:@literal:78] -> half_type, {768}, {1}: 0.00109522ms, 1%
@17 = hip::hip_copy_literal[id=main:@literal:59] -> half_type, {768}, {1}: 0.00108192ms, 1%
@18 = slice[axes={2},starts={768},ends={1536}](@15) -> half_type, {24, 77, 768}, {177408, 2304, 1}: 0.00165542ms, 1%
@19 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@17) -> half_type, {24, 77, 768}, {0, 0, 1}: 0.00094074ms, 1%
@20 = load[offset=18184320,end=21022848](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.00076536ms, 1%
**@21 = gpu::code_object[code_object=5128,symbol_name=add_kernel,global=354816,local=1024,](@19,@18,@20) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.0211362ms, 1%**
@22 = load[offset=11354112,end=14192640](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.00099472ms, 1%
@23 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@16) -> half_type, {24, 77, 768}, {0, 0, 1}: 0.00182424ms, 1%
@24 = slice[axes={2},starts={0},ends={768}](@15) -> half_type, {24, 77, 768}, {177408, 2304, 1}: 0.00103286ms, 1%
**@25 = gpu::code_object[code_object=5136,symbol_name=mul_add_kernel,global=354816,local=1024,](@24,@23,@22) -> half_type, {24, 77, 768}, {59136, 768, 1}: 0.0413997ms, 1%**
@26 = load[offset=14769216,end=18184320](@1) -> half_type, {24, 12, 77, 77}, {71148, 5929, 77, 1}: 0.00105ms, 1%
@27 = gpu::code_object[code_object=6736,symbol_name=mlir_reshape_transpose_reshape_transpose_dot,global=73728,local=256,](@25,@21,@26) -> half_type, {24, 12, 77, 77}, {71148, 5929, 77, 1}: 0.0248955ms, 1%
...
@32 = load[offset=14769216,end=17607744](@1) -> half_type, {24, 77, 768}, {59136, 768, 1}
@33 = multibroadcast[out_lens={24, 77, 768},out_dyn_dims={}](@31) -> half_type, {24, 77, 768}, {0, 0, 1}
@34 = slice[axes={2},starts={1536},ends={2304}](@14) -> half_type, {24, 77, 768}, {177408, 2304, 1}
**@35 = gpu::code_object[code_object=5128,symbol_name=add_kernel,global=354816,local=1024,](@33,@34,@32) -> half_type, {24, 77, 768}, {59136, 768, 1}**
Output with this PR -
@135 = gpu::code_object[code_object=6016,symbol_name=mlir_dot_add,global=23040,local=64,](@64,@133,@56,@134) -> half_type, {1, 77, 2304}, {177408, 2304, 1}
@136 = gpu::precompile_op[op=gpu::mlir_op[op=dot],additional_args=1,ignore_modules=0,output_shape=nullopt](@64,@133,@56,@134), [mlir_main:pointwise12] -> half_type, {1, 77, 2304}, {177408, 2304, 1}
@137 = hip::allocate[shape=half_type, {12, 77, 77}, {5929, 77, 1}] -> half_type, {12, 77, 77}, {5929, 77, 1}
@138 = gpu::code_object[code_object=5248,symbol_name=mlir_slice_mul_reshape_transpose_squeeze_slice_reshape_transpose_squeeze_dot,global=9600,local=32,](@135,@137) -> half_type, {12, 77, 77}, {5929, 77, 1}
Also, unit tests should be added to test for the original test case.
There is still a test case missing for the original issue found in the model where there is a add->dot->add being horizontally fused with one dot having an additional multiply.
| Test | Batch | Rate new d13d92 |
Rate old 3aee3a |
Diff | Compare |
|---|---|---|---|---|---|
| torchvision-resnet50 | 64 | 3,235.35 | 3,235.04 | 0.01% | :white_check_mark: |
| torchvision-resnet50_fp16 | 64 | 6,879.20 | 6,877.66 | 0.02% | :white_check_mark: |
| torchvision-densenet121 | 32 | 2,438.02 | 2,436.52 | 0.06% | :white_check_mark: |
| torchvision-densenet121_fp16 | 32 | 4,203.26 | 4,189.80 | 0.32% | :white_check_mark: |
| torchvision-inceptionv3 | 32 | 1,614.60 | 1,613.28 | 0.08% | :white_check_mark: |
| torchvision-inceptionv3_fp16 | 32 | 2,678.90 | 2,677.62 | 0.05% | :white_check_mark: |
| cadene-inceptionv4 | 16 | 750.56 | 750.18 | 0.05% | :white_check_mark: |
| cadene-resnext64x4 | 16 | 809.32 | 808.82 | 0.06% | :white_check_mark: |
| slim-mobilenet | 64 | 6,661.39 | 6,663.29 | -0.03% | :white_check_mark: |
| slim-nasnetalarge | 64 | 199.02 | 198.98 | 0.02% | :white_check_mark: |
| slim-resnet50v2 | 64 | 3,425.74 | 3,424.55 | 0.03% | :white_check_mark: |
| bert-mrpc-onnx | 8 | 1,144.85 | 1,141.83 | 0.26% | :white_check_mark: |
| bert-mrpc-tf | 1 | 482.28 | 473.12 | 1.94% | :white_check_mark: |
| pytorch-examples-wlang-gru | 1 | 475.99 | 473.81 | 0.46% | :white_check_mark: |
| pytorch-examples-wlang-lstm | 1 | 437.30 | 426.27 | 2.59% | :white_check_mark: |
| torchvision-resnet50_1 | 1 | 805.46 | 796.61 | 1.11% | :white_check_mark: |
| cadene-dpn92_1 | 1 | 429.32 | 428.61 | 0.17% | :white_check_mark: |
| cadene-resnext101_1 | 1 | 389.17 | 389.68 | -0.13% | :white_check_mark: |
| onnx-taau-downsample | 1 | 371.37 | 372.29 | -0.25% | :white_check_mark: |
| dlrm-criteoterabyte | 1 | 31.81 | 30.54 | 4.15% | :high_brightness: |
| dlrm-criteoterabyte_fp16 | 1 | 51.02 | 49.05 | 4.01% | :high_brightness: |
| agentmodel | 1 | 8,615.42 | 8,978.43 | -4.04% | :red_circle: |
| unet_fp16 | 2 | 58.09 | 57.93 | 0.28% | :white_check_mark: |
| resnet50v1_fp16 | 1 | 1,043.98 | 1,010.88 | 3.27% | :high_brightness: |
| resnet50v1_int8 | 1 | 806.93 | 802.03 | 0.61% | :white_check_mark: |
| bert_base_cased_fp16 | 64 | 1,172.91 | 1,172.65 | 0.02% | :white_check_mark: |
| bert_large_uncased_fp16 | 32 | 362.37 | 362.27 | 0.03% | :white_check_mark: |
| bert_large_fp16 | 1 | 200.45 | 198.55 | 0.96% | :white_check_mark: |
| distilgpt2_fp16 | 16 | 2,216.05 | 2,215.63 | 0.02% | :white_check_mark: |
| yolov5s | 1 | 523.13 | 530.86 | -1.46% | :white_check_mark: |
| tinyllama | 1 | 43.60 | 43.55 | 0.11% | :white_check_mark: |
| vicuna-fastchat | 1 | 43.89 | 43.92 | -0.06% | :white_check_mark: |
| whisper-tiny-encoder | 1 | 411.90 | 410.51 | 0.34% | :white_check_mark: |
| whisper-tiny-decoder | 1 | 410.91 | 407.47 | 0.84% | :white_check_mark: |
This build is not recommended to merge :red_circle:
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output
@pfultz2 Please review this so it can be merged