AMDMIGraphX
AMDMIGraphX copied to clipboard
Add pass to convert Uint8 to int8 across operators
~Related to #1904 ~
Originally from when we were parsing in dynamicquantizelinear but we've found other operators weren't handling inputs of uint8 correctly and thus we need to add a pass to properly handle conversion.
This is dependent on the more recent #2888 so that we can then handle MatmulInteger,ConvInteger and DynamicQuantizeLinear and uint8 during compile time rather than in the parser
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 91.86%. Comparing base (
263509b
) to head (2cde7cc
). Report is 1 commits behind head on develop.
:exclamation: Current head 2cde7cc differs from pull request most recent head 2654eae
Please upload reports for the commit 2654eae to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## develop #2826 +/- ##
===========================================
+ Coverage 91.82% 91.86% +0.04%
===========================================
Files 486 480 -6
Lines 18993 18240 -753
===========================================
- Hits 17440 16757 -683
+ Misses 1553 1483 -70
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.
Can you explain this in more detail?
Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.
Can you explain this in more detail?
When speaking with MLIR they said they don't support uint8 x int8 quant_dots along with just uint8 in general for their kernels. I was seeing failures when using models optimized by Onnxruntime and using int8. This is related to the drop in performance is we would fail to compile the model in onnxruntime and then default to ROCm EP and or CPU
But why are we getting mixed types like this in the first place?
I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.
But why are we getting mixed types like this in the first place?
Because the implimentation of dynamicquantizelinear breaks it up into separate instructions which hardcode uint8 as part of the output type for the zero point which is defined by the onnx standard.
I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.
Doing more tests on this seems to break backend onnx tests since there is an expectation that all zero point output will be uint8. So it appears we should be doing a pass then if one of the inputs to a quantize linear is then a uint8_t type to add in the convert
Moving to draft. Will craft better matcher after tomorrow morning's coffee. @lakhinderwalia I see your point from Monday's meeting. Looks like we can just target and replace q_min/q_max as well as the convert at the end to adjust the data range without requiring the extra shift add and converts.
Updated this to be a separate pass. Seeing some odd behavior on the added tests when this gets converted to kernels on the gpu. Need to sort out that last bit but the rest should be good for review.
There appears to have been an bug in how we were parsing which was causing the failures in the verify tests when that datatype is moved from uint8_t to int8_t. Change has been mirrored in the parser tests
Updated things here for the failing UTs, turns out if it was flagging overflow with dot (y,y) with some test cases.
Alternatively, It looks like I can't get this to update without removing the same_type() check in dot. @pfultz2 let me know if there's an alternative way we should handle this.
@pfultz2 you were right I needed to use insert here instead of add along with looking at the quantizelinear as well as part of the update.
Removed any changes to our dot() but also found bug in how we handled input to quantizelinear. Updated parse test since I'm now adding convert on the input to either dot/conv prior to adding in the slice to avoid type errors which should get optimized out later once the types match up.
Split out changes for parse_dynamcquantizelinear for this one. Retargeted branch to that PR until that changeset gets in. Will revisit one that's merged
@TedThemistokleous to review and resolve conflicts
DQL fix + Pass changes
Summary:
gpu::code_object::reduce_min_min_kernel: 2.04166ms / 49 = 0.0416666ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.03947ms / 49 = 0.0416219ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66411ms / 48 = 0.0346689ms, 10%
gpu::code_object::mlir_quant_dot: 1.35967ms / 47 = 0.0289292ms, 9%
gpu::code_object::convert_kernel: 1.15707ms / 50 = 0.0231415ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.15445ms / 49 = 0.0235602ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14034ms / 49 = 0.0232722ms, 7%
gpu::code_object::mul_kernel: 1.12784ms / 49 = 0.0230171ms, 7%
gpu::code_object::contiguous_kernel: 0.854442ms / 36 = 0.0237345ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833104ms / 36 = 0.0231418ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.580536ms / 24 = 0.024189ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.534086ms / 23 = 0.0232211ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.38627ms / 13 = 0.0297131ms, 3%
load: 0.316312ms / 537 = 0.000589036ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.293268ms / 12 = 0.024439ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_add_mul_exp_add_div_kernel: 0.287277ms / 12 = 0.0239397ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.280463ms / 12 = 0.0233719ms, 2%
multibroadcast: 0.245326ms / 295 = 0.000831613ms, 2%
hip::hip_copy_literal: 0.0990334ms / 150 = 0.000660223ms, 1%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.0969119ms / 1 = 0.0969119ms, 1%
reshape_lazy: 0.0966945ms / 131 = 0.000738126ms, 1%
transpose: 0.0604593ms / 48 = 0.00125957ms, 1%
slice: 0.04459ms / 36 = 0.00123861ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248186ms / 1 = 0.0248186ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0238949ms / 1 = 0.0238949ms, 1%
gpu::code_object::gather_kernel: 0.0233956ms / 1 = 0.0233956ms, 1%
@param: 0.00996924ms / 26 = 0.000383432ms, 1%
hip::hip_allocate_memory: 0.00088522ms / 1 = 0.00088522ms, 1%
check_context::migraphx::gpu::context: 0.00072764ms / 1 = 0.00072764ms, 1%
Batch size: 1
Rate: 189.117 inferences/sec
Total time: 5.28774ms
Total instructions time: 16.7771ms
Overhead time: 0.371628ms, -11.4893ms
Overhead: 7%, -217%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8
Summary:
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.05055ms / 49 = 0.0418479ms, 13%
gpu::code_object::reduce_min_min_kernel: 2.04459ms / 49 = 0.0417264ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66972ms / 48 = 0.0347858ms, 10%
gpu::code_object::mlir_quant_dot: 1.36225ms / 47 = 0.028984ms, 9%
gpu::code_object::convert_kernel: 1.16309ms / 50 = 0.0232618ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16019ms / 49 = 0.0236774ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14452ms / 49 = 0.0233575ms, 7%
gpu::code_object::mul_kernel: 1.13133ms / 49 = 0.0230883ms, 7%
gpu::code_object::contiguous_kernel: 0.860616ms / 36 = 0.023906ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833249ms / 36 = 0.0231458ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.583202ms / 24 = 0.0243001ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.537737ms / 23 = 0.0233799ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.385738ms / 13 = 0.0296722ms, 3%
load: 0.314656ms / 537 = 0.000585952ms, 2%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.311504ms / 12 = 0.0259587ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.295388ms / 12 = 0.0246157ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.285528ms / 12 = 0.023794ms, 2%
multibroadcast: 0.257628ms / 295 = 0.000873314ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.100579ms / 1 = 0.100579ms, 1%
hip::hip_copy_literal: 0.0988059ms / 150 = 0.000658706ms, 1%
reshape_lazy: 0.0981314ms / 131 = 0.000749094ms, 1%
transpose: 0.0633442ms / 48 = 0.00131967ms, 1%
slice: 0.0449541ms / 36 = 0.00124872ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0250007ms / 1 = 0.0250007ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0242269ms / 1 = 0.0242269ms, 1%
gpu::code_object::gather_kernel: 0.0238603ms / 1 = 0.0238603ms, 1%
@param: 0.0102222ms / 26 = 0.000393161ms, 1%
check_context::migraphx::gpu::context: 0.0009512ms / 1 = 0.0009512ms, 1%
hip::hip_allocate_memory: 0.0009052ms / 1 = 0.0009052ms, 1%
Batch size: 1
Rate: 190.495 inferences/sec
Total time: 5.24948ms
Total instructions time: 16.8825ms
Overhead time: 0.37222ms, -11.633ms
Overhead: 7%, -222%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8
Not required but can still be used as perf improvement now 2903 has been added. Will need to determine speedup after rebase.
Fixes were pulled out of here for some other changes but I'm still curious on perf if we can just auto convert uint8/ handle matmul correctly now
Test | Batch | Rate new 63952f |
Rate old ae2b02 |
Diff | Compare |
---|---|---|---|---|---|
torchvision-resnet50 | 64 | 3,232.60 | 3,232.34 | 0.01% | :white_check_mark: |
torchvision-resnet50_fp16 | 64 | 6,870.53 | 6,875.78 | -0.08% | :white_check_mark: |
torchvision-densenet121 | 32 | 2,428.49 | 2,429.76 | -0.05% | :white_check_mark: |
torchvision-densenet121_fp16 | 32 | 4,044.59 | 4,068.80 | -0.60% | :white_check_mark: |
torchvision-inceptionv3 | 32 | 1,633.77 | 1,636.23 | -0.15% | :white_check_mark: |
torchvision-inceptionv3_fp16 | 32 | 2,742.80 | 2,744.68 | -0.07% | :white_check_mark: |
cadene-inceptionv4 | 16 | 769.95 | 771.82 | -0.24% | :white_check_mark: |
cadene-resnext64x4 | 16 | 802.01 | 802.64 | -0.08% | :white_check_mark: |
slim-mobilenet | 64 | 7,430.92 | 7,438.28 | -0.10% | :white_check_mark: |
slim-nasnetalarge | 64 | 207.02 | 207.40 | -0.18% | :white_check_mark: |
slim-resnet50v2 | 64 | 3,337.00 | 3,328.69 | 0.25% | :white_check_mark: |
bert-mrpc-onnx | 8 | 1,152.71 | 1,149.05 | 0.32% | :white_check_mark: |
bert-mrpc-tf | 1 | 310.84 | 308.98 | 0.60% | :white_check_mark: |
pytorch-examples-wlang-gru | 1 | 413.25 | 416.23 | -0.72% | :white_check_mark: |
pytorch-examples-wlang-lstm | 1 | 387.91 | 374.95 | 3.46% | :high_brightness: |
torchvision-resnet50_1 | 1 | 798.49 | 799.37 | -0.11% | :white_check_mark: |
cadene-dpn92_1 | 1 | 431.50 | 432.81 | -0.30% | :white_check_mark: |
cadene-resnext101_1 | 1 | 379.03 | 378.19 | 0.22% | :white_check_mark: |
onnx-taau-downsample | 1 | 344.37 | 345.37 | -0.29% | :white_check_mark: |
dlrm-criteoterabyte | 1 | 35.00 | 35.05 | -0.13% | :white_check_mark: |
dlrm-criteoterabyte_fp16 | 1 | 57.26 | 57.40 | -0.24% | :white_check_mark: |
agentmodel | 1 | 9,579.00 | 9,772.50 | -1.98% | :white_check_mark: |
unet_fp16 | 2 | 57.91 | 57.79 | 0.21% | :white_check_mark: |
resnet50v1_fp16 | 1 | 931.31 | 939.67 | -0.89% | :white_check_mark: |
resnet50v1_int8 | 1 | 955.32 | 929.28 | 2.80% | :white_check_mark: |
bert_base_cased_fp16 | 64 | 1,136.01 | 1,142.07 | -0.53% | :white_check_mark: |
bert_large_uncased_fp16 | 32 | 350.06 | 351.79 | -0.49% | :white_check_mark: |
bert_large_fp16 | 1 | 208.15 | 209.75 | -0.76% | :white_check_mark: |
distilgpt2_fp16 | 16 | 2,138.98 | 2,151.98 | -0.60% | :white_check_mark: |
yolov5s | 1 | 508.41 | 506.15 | 0.45% | :white_check_mark: |
tinyllama | 1 | 43.35 | 43.34 | 0.02% | :white_check_mark: |
vicuna-fastchat | 1 | 174.09 | 169.35 | 2.80% | :white_check_mark: |
whisper-tiny-encoder | 1 | 410.11 | 411.82 | -0.42% | :white_check_mark: |
whisper-tiny-decoder | 1 | 427.12 | 429.16 | -0.47% | :white_check_mark: |
Check results before merge :high_brightness:
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output
@TedThemistokleous Is this PR still needed ?