AMDMIGraphX Add pass to convert Uint8 to int8 across operators

~Related to #1904 ~

Originally from when we were parsing in dynamicquantizelinear but we've found other operators weren't handling inputs of uint8 correctly and thus we need to add a pass to properly handle conversion.

This is dependent on the more recent #2888 so that we can then handle MatmulInteger,ConvInteger and DynamicQuantizeLinear and uint8 during compile time rather than in the parser

Feb 23 '24 15:02 TedThemistokleous

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.86%. Comparing base (263509b) to head (2cde7cc). Report is 1 commits behind head on develop.

:exclamation: Current head 2cde7cc differs from pull request most recent head 2654eae

Please upload reports for the commit 2654eae to get more accurate results.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2826      +/-   ##
===========================================
+ Coverage    91.82%   91.86%   +0.04%     
===========================================
  Files          486      480       -6     
  Lines        18993    18240     -753     
===========================================
- Hits         17440    16757     -683     
+ Misses        1553     1483      -70

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Feb 23 '24 16:02 codecov[bot]

Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.

Can you explain this in more detail?

Feb 23 '24 16:02 pfultz2

Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.

Can you explain this in more detail?

When speaking with MLIR they said they don't support uint8 x int8 quant_dots along with just uint8 in general for their kernels. I was seeing failures when using models optimized by Onnxruntime and using int8. This is related to the drop in performance is we would fail to compile the model in onnxruntime and then default to ROCm EP and or CPU

Feb 23 '24 18:02 TedThemistokleous

But why are we getting mixed types like this in the first place?

I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.

Feb 23 '24 18:02 pfultz2

But why are we getting mixed types like this in the first place?

Because the implimentation of dynamicquantizelinear breaks it up into separate instructions which hardcode uint8 as part of the output type for the zero point which is defined by the onnx standard.

I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.

Doing more tests on this seems to break backend onnx tests since there is an expectation that all zero point output will be uint8. So it appears we should be doing a pass then if one of the inputs to a quantize linear is then a uint8_t type to add in the convert

Feb 23 '24 19:02 TedThemistokleous

Moving to draft. Will craft better matcher after tomorrow morning's coffee. @lakhinderwalia I see your point from Monday's meeting. Looks like we can just target and replace q_min/q_max as well as the convert at the end to adjust the data range without requiring the extra shift add and converts.

Feb 28 '24 06:02 TedThemistokleous

Updated this to be a separate pass. Seeing some odd behavior on the added tests when this gets converted to kernels on the gpu. Need to sort out that last bit but the rest should be good for review.

Feb 28 '24 18:02 TedThemistokleous

There appears to have been an bug in how we were parsing which was causing the failures in the verify tests when that datatype is moved from uint8_t to int8_t. Change has been mirrored in the parser tests

Feb 29 '24 01:02 TedThemistokleous

Updated things here for the failing UTs, turns out if it was flagging overflow with dot (y,y) with some test cases.

Alternatively, It looks like I can't get this to update without removing the same_type() check in dot. @pfultz2 let me know if there's an alternative way we should handle this.

Mar 06 '24 22:03 TedThemistokleous

@pfultz2 you were right I needed to use insert here instead of add along with looking at the quantizelinear as well as part of the update.

Removed any changes to our dot() but also found bug in how we handled input to quantizelinear. Updated parse test since I'm now adding convert on the input to either dot/conv prior to adding in the slice to avoid type errors which should get optimized out later once the types match up.

Mar 14 '24 05:03 TedThemistokleous

Split out changes for parse_dynamcquantizelinear for this one. Retargeted branch to that PR until that changeset gets in. Will revisit one that's merged

Mar 18 '24 18:03 TedThemistokleous

@TedThemistokleous to review and resolve conflicts

Mar 18 '24 19:03 causten

DQL fix + Pass changes

Summary:
gpu::code_object::reduce_min_min_kernel: 2.04166ms / 49 = 0.0416666ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.03947ms / 49 = 0.0416219ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66411ms / 48 = 0.0346689ms, 10%
gpu::code_object::mlir_quant_dot: 1.35967ms / 47 = 0.0289292ms, 9%
gpu::code_object::convert_kernel: 1.15707ms / 50 = 0.0231415ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.15445ms / 49 = 0.0235602ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14034ms / 49 = 0.0232722ms, 7%
gpu::code_object::mul_kernel: 1.12784ms / 49 = 0.0230171ms, 7%
gpu::code_object::contiguous_kernel: 0.854442ms / 36 = 0.0237345ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833104ms / 36 = 0.0231418ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.580536ms / 24 = 0.024189ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.534086ms / 23 = 0.0232211ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.38627ms / 13 = 0.0297131ms, 3%
load: 0.316312ms / 537 = 0.000589036ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.293268ms / 12 = 0.024439ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_add_mul_exp_add_div_kernel: 0.287277ms / 12 = 0.0239397ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.280463ms / 12 = 0.0233719ms, 2%
multibroadcast: 0.245326ms / 295 = 0.000831613ms, 2%
hip::hip_copy_literal: 0.0990334ms / 150 = 0.000660223ms, 1%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.0969119ms / 1 = 0.0969119ms, 1%
reshape_lazy: 0.0966945ms / 131 = 0.000738126ms, 1%
transpose: 0.0604593ms / 48 = 0.00125957ms, 1%
slice: 0.04459ms / 36 = 0.00123861ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248186ms / 1 = 0.0248186ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0238949ms / 1 = 0.0238949ms, 1%
gpu::code_object::gather_kernel: 0.0233956ms / 1 = 0.0233956ms, 1%
@param: 0.00996924ms / 26 = 0.000383432ms, 1%
hip::hip_allocate_memory: 0.00088522ms / 1 = 0.00088522ms, 1%
check_context::migraphx::gpu::context: 0.00072764ms / 1 = 0.00072764ms, 1%

Batch size: 1
Rate: 189.117 inferences/sec
Total time: 5.28774ms
Total instructions time: 16.7771ms
Overhead time: 0.371628ms, -11.4893ms
Overhead: 7%, -217%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8


Summary:
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.05055ms / 49 = 0.0418479ms, 13%
gpu::code_object::reduce_min_min_kernel: 2.04459ms / 49 = 0.0417264ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66972ms / 48 = 0.0347858ms, 10%
gpu::code_object::mlir_quant_dot: 1.36225ms / 47 = 0.028984ms, 9%
gpu::code_object::convert_kernel: 1.16309ms / 50 = 0.0232618ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16019ms / 49 = 0.0236774ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14452ms / 49 = 0.0233575ms, 7%
gpu::code_object::mul_kernel: 1.13133ms / 49 = 0.0230883ms, 7%
gpu::code_object::contiguous_kernel: 0.860616ms / 36 = 0.023906ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833249ms / 36 = 0.0231458ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.583202ms / 24 = 0.0243001ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.537737ms / 23 = 0.0233799ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.385738ms / 13 = 0.0296722ms, 3%
load: 0.314656ms / 537 = 0.000585952ms, 2%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.311504ms / 12 = 0.0259587ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.295388ms / 12 = 0.0246157ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.285528ms / 12 = 0.023794ms, 2%
multibroadcast: 0.257628ms / 295 = 0.000873314ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.100579ms / 1 = 0.100579ms, 1%
hip::hip_copy_literal: 0.0988059ms / 150 = 0.000658706ms, 1%
reshape_lazy: 0.0981314ms / 131 = 0.000749094ms, 1%
transpose: 0.0633442ms / 48 = 0.00131967ms, 1%
slice: 0.0449541ms / 36 = 0.00124872ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0250007ms / 1 = 0.0250007ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0242269ms / 1 = 0.0242269ms, 1%
gpu::code_object::gather_kernel: 0.0238603ms / 1 = 0.0238603ms, 1%
@param: 0.0102222ms / 26 = 0.000393161ms, 1%
check_context::migraphx::gpu::context: 0.0009512ms / 1 = 0.0009512ms, 1%
hip::hip_allocate_memory: 0.0009052ms / 1 = 0.0009052ms, 1%

Batch size: 1
Rate: 190.495 inferences/sec
Total time: 5.24948ms
Total instructions time: 16.8825ms
Overhead time: 0.37222ms, -11.633ms
Overhead: 7%, -222%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8

Mar 18 '24 19:03 TedThemistokleous

Not required but can still be used as perf improvement now 2903 has been added. Will need to determine speedup after rebase.

Fixes were pulled out of here for some other changes but I'm still curious on perf if we can just auto convert uint8/ handle matmul correctly now

Apr 09 '24 13:04 TedThemistokleous

Test Batch Rate new
63952f Rate old
ae2b02 Diff Compare

torchvision-resnet50 64 3,232.60 3,232.34 0.01% :white_check_mark:

torchvision-resnet50_fp16 64 6,870.53 6,875.78 -0.08% :white_check_mark:

torchvision-densenet121 32 2,428.49 2,429.76 -0.05% :white_check_mark:

torchvision-densenet121_fp16 32 4,044.59 4,068.80 -0.60% :white_check_mark:

torchvision-inceptionv3 32 1,633.77 1,636.23 -0.15% :white_check_mark:

torchvision-inceptionv3_fp16 32 2,742.80 2,744.68 -0.07% :white_check_mark:

cadene-inceptionv4 16 769.95 771.82 -0.24% :white_check_mark:

cadene-resnext64x4 16 802.01 802.64 -0.08% :white_check_mark:

slim-mobilenet 64 7,430.92 7,438.28 -0.10% :white_check_mark:

slim-nasnetalarge 64 207.02 207.40 -0.18% :white_check_mark:

slim-resnet50v2 64 3,337.00 3,328.69 0.25% :white_check_mark:

bert-mrpc-onnx 8 1,152.71 1,149.05 0.32% :white_check_mark:

bert-mrpc-tf 1 310.84 308.98 0.60% :white_check_mark:

pytorch-examples-wlang-gru 1 413.25 416.23 -0.72% :white_check_mark:

pytorch-examples-wlang-lstm 1 387.91 374.95 3.46% :high_brightness:

torchvision-resnet50_1 1 798.49 799.37 -0.11% :white_check_mark:

cadene-dpn92_1 1 431.50 432.81 -0.30% :white_check_mark:

cadene-resnext101_1 1 379.03 378.19 0.22% :white_check_mark:

onnx-taau-downsample 1 344.37 345.37 -0.29% :white_check_mark:

dlrm-criteoterabyte 1 35.00 35.05 -0.13% :white_check_mark:

dlrm-criteoterabyte_fp16 1 57.26 57.40 -0.24% :white_check_mark:

agentmodel 1 9,579.00 9,772.50 -1.98% :white_check_mark:

unet_fp16 2 57.91 57.79 0.21% :white_check_mark:

resnet50v1_fp16 1 931.31 939.67 -0.89% :white_check_mark:

resnet50v1_int8 1 955.32 929.28 2.80% :white_check_mark:

bert_base_cased_fp16 64 1,136.01 1,142.07 -0.53% :white_check_mark:

bert_large_uncased_fp16 32 350.06 351.79 -0.49% :white_check_mark:

bert_large_fp16 1 208.15 209.75 -0.76% :white_check_mark:

distilgpt2_fp16 16 2,138.98 2,151.98 -0.60% :white_check_mark:

yolov5s 1 508.41 506.15 0.45% :white_check_mark:

tinyllama 1 43.35 43.34 0.02% :white_check_mark:

vicuna-fastchat 1 174.09 169.35 2.80% :white_check_mark:

whisper-tiny-encoder 1 410.11 411.82 -0.42% :white_check_mark:

whisper-tiny-decoder 1 427.12 429.16 -0.47% :white_check_mark:

Test	Batch	Rate new 63952f	Rate old ae2b02	Diff	Compare
torchvision-resnet50	64	3,232.60	3,232.34	0.01%	:white_check_mark:
torchvision-resnet50_fp16	64	6,870.53	6,875.78	-0.08%	:white_check_mark:
torchvision-densenet121	32	2,428.49	2,429.76	-0.05%	:white_check_mark:
torchvision-densenet121_fp16	32	4,044.59	4,068.80	-0.60%	:white_check_mark:
torchvision-inceptionv3	32	1,633.77	1,636.23	-0.15%	:white_check_mark:
torchvision-inceptionv3_fp16	32	2,742.80	2,744.68	-0.07%	:white_check_mark:
cadene-inceptionv4	16	769.95	771.82	-0.24%	:white_check_mark:
cadene-resnext64x4	16	802.01	802.64	-0.08%	:white_check_mark:
slim-mobilenet	64	7,430.92	7,438.28	-0.10%	:white_check_mark:
slim-nasnetalarge	64	207.02	207.40	-0.18%	:white_check_mark:
slim-resnet50v2	64	3,337.00	3,328.69	0.25%	:white_check_mark:
bert-mrpc-onnx	8	1,152.71	1,149.05	0.32%	:white_check_mark:
bert-mrpc-tf	1	310.84	308.98	0.60%	:white_check_mark:
pytorch-examples-wlang-gru	1	413.25	416.23	-0.72%	:white_check_mark:
pytorch-examples-wlang-lstm	1	387.91	374.95	3.46%	:high_brightness:
torchvision-resnet50_1	1	798.49	799.37	-0.11%	:white_check_mark:
cadene-dpn92_1	1	431.50	432.81	-0.30%	:white_check_mark:
cadene-resnext101_1	1	379.03	378.19	0.22%	:white_check_mark:
onnx-taau-downsample	1	344.37	345.37	-0.29%	:white_check_mark:
dlrm-criteoterabyte	1	35.00	35.05	-0.13%	:white_check_mark:
dlrm-criteoterabyte_fp16	1	57.26	57.40	-0.24%	:white_check_mark:
agentmodel	1	9,579.00	9,772.50	-1.98%	:white_check_mark:
unet_fp16	2	57.91	57.79	0.21%	:white_check_mark:
resnet50v1_fp16	1	931.31	939.67	-0.89%	:white_check_mark:
resnet50v1_int8	1	955.32	929.28	2.80%	:white_check_mark:
bert_base_cased_fp16	64	1,136.01	1,142.07	-0.53%	:white_check_mark:
bert_large_uncased_fp16	32	350.06	351.79	-0.49%	:white_check_mark:
bert_large_fp16	1	208.15	209.75	-0.76%	:white_check_mark:
distilgpt2_fp16	16	2,138.98	2,151.98	-0.60%	:white_check_mark:
yolov5s	1	508.41	506.15	0.45%	:white_check_mark:
tinyllama	1	43.35	43.34	0.02%	:white_check_mark:
vicuna-fastchat	1	174.09	169.35	2.80%	:white_check_mark:
whisper-tiny-encoder	1	410.11	411.82	-0.42%	:white_check_mark:
whisper-tiny-decoder	1	427.12	429.16	-0.47%	:white_check_mark:

Check results before merge :high_brightness:

Aug 15 '24 12:08 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Aug 15 '24 12:08 migraphx-bot

@TedThemistokleous Is this PR still needed ?

Sep 04 '24 16:09 causten

AMDMIGraphX AMDMIGraphX copied to clipboard

Add pass to convert Uint8 to int8 across operators

Codecov Report

AMDMIGraphX
AMDMIGraphX copied to clipboard