AMDMIGraphX Integrate hipblaslt calls in gemm op

Jan 22 '24 08:01 mqhc2020

Codecov Report

Attention: Patch coverage is 91.44444% with 77 lines in your changes are missing coverage. Please review.

Project coverage is 91.82%. Comparing base (f0cb545) to head (f802ed1). Report is 100 commits behind head on develop.

:exclamation: Current head f802ed1 differs from pull request most recent head 1913b56

Please upload reports for the commit 1913b56 to get more accurate results.

Files	Patch %	Lines
src/module.cpp	82.11%	27 Missing :warning:
src/argument.cpp	0.00%	7 Missing :warning:
src/shape.cpp	0.00%	7 Missing :warning:
src/cpp_generator.cpp	40.00%	6 Missing :warning:
src/fuse_pointwise_reduce.cpp	0.00%	6 Missing :warning:
src/split_reduce.cpp	94.04%	5 Missing :warning:
src/onnx/parse_matmul.cpp	96.05%	3 Missing :warning:
src/program.cpp	70.00%	3 Missing :warning:
src/fuse_reduce.cpp	95.65%	2 Missing :warning:
src/onnx/parse_convolution.cpp	97.75%	2 Missing :warning:
... and 7 more

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2671      +/-   ##
===========================================
- Coverage    91.83%   91.82%   -0.01%     
===========================================
  Files          479      486       +7     
  Lines        18340    18991     +651     
===========================================
+ Hits         16842    17438     +596     
- Misses        1498     1553      +55

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jan 22 '24 08:01 codecov[bot]

This should be a different operator. We can add a compile_blas similar to compile_miopen that can do the tuning for convolution through the compile function in operation interface. We just insert a placeholder operator during lowering, and then compile_blas will run the compile function to tune which operator is the fastest. Then this function can return the time and needed workspace.

The workspace should be inserted as an allocate instruction so the memory can be reused by non-hipblas kernels. So by returning the workspace we can insert the needed allocate instruction and then pass it as a parameter to the hipblas operator.

Feb 17 '24 00:02 pfultz2

Test Batch Rate new
f802ed Rate old
bceef1 Diff Compare

torchvision-resnet50 64 1,713.86 1,712.95 0.05% :white_check_mark:

torchvision-resnet50_fp16 64 3,811.66 3,811.61 0.00% :white_check_mark:

torchvision-densenet121 32 1,454.57 1,458.96 -0.30% :white_check_mark:

torchvision-densenet121_fp16 32 2,438.67 2,432.65 0.25% :white_check_mark:

torchvision-inceptionv3 32 883.89 883.89 0.00% :white_check_mark:

torchvision-inceptionv3_fp16 32 623.23 1,381.08 -54.87% :red_circle:

cadene-inceptionv4 16 400.41 408.06 -1.87% :white_check_mark:

cadene-resnext64x4 16 413.56 413.67 -0.03% :white_check_mark:

slim-mobilenet 64 3,172.10 3,824.46 -17.06% :red_circle:

slim-nasnetalarge 64 91.08 97.02 -6.12% :red_circle:

slim-resnet50v2 64 998.48 1,651.41 -39.54% :red_circle:

bert-mrpc-onnx 8 570.25 589.10 -3.20% :red_circle:

bert-mrpc-tf 1 286.89 288.10 -0.42% :white_check_mark:

pytorch-examples-wlang-gru 1 333.47 300.11 11.12% :high_brightness:

pytorch-examples-wlang-lstm 1 267.31 265.42 0.71% :white_check_mark:

torchvision-resnet50_1 1 455.44 437.05 4.21% :high_brightness:

cadene-dpn92_1 1 244.22 243.98 0.10% :white_check_mark:

cadene-resnext101_1 1 187.20 189.16 -1.03% :white_check_mark:

onnx-taau-downsample 1 190.07 204.02 -6.84% :red_circle:

dlrm-criteoterabyte 1 10.64 22.24 -52.16% :red_circle:

dlrm-criteoterabyte_fp16 1 38.45 41.43 -7.20% :red_circle:

agentmodel 1 5,871.30 6,319.63 -7.09% :red_circle:

unet_fp16 2 33.59 33.77 -0.52% :white_check_mark:

resnet50v1_fp16 1 527.96 549.27 -3.88% :red_circle:

resnet50v1_int8 1 458.55 456.03 0.55% :white_check_mark:

bert_base_cased_fp16 64 617.62 620.67 -0.49% :white_check_mark:

bert_large_uncased_fp16 32 193.87 193.76 0.06% :white_check_mark:

bert_large_fp16 1 103.80 103.76 0.04% :white_check_mark:

distilgpt2_fp16 16 1,188.12 1,186.96 0.10% :white_check_mark:

yolov5s 1 297.82 298.20 -0.13% :white_check_mark:

tinyllama 1 23.32 23.32 0.02% :white_check_mark:

vicuna-fastchat 1 133.31 133.50 -0.15% :white_check_mark:

whisper-tiny-encoder 1 241.09 240.87 0.09% :white_check_mark:

whisper-tiny-decoder 1 89.16 245.38 -63.67% :red_circle:

Test	Batch	Rate new f802ed	Rate old bceef1	Diff	Compare
torchvision-resnet50	64	1,713.86	1,712.95	0.05%	:white_check_mark:
torchvision-resnet50_fp16	64	3,811.66	3,811.61	0.00%	:white_check_mark:
torchvision-densenet121	32	1,454.57	1,458.96	-0.30%	:white_check_mark:
torchvision-densenet121_fp16	32	2,438.67	2,432.65	0.25%	:white_check_mark:
torchvision-inceptionv3	32	883.89	883.89	0.00%	:white_check_mark:
torchvision-inceptionv3_fp16	32	623.23	1,381.08	-54.87%	:red_circle:
cadene-inceptionv4	16	400.41	408.06	-1.87%	:white_check_mark:
cadene-resnext64x4	16	413.56	413.67	-0.03%	:white_check_mark:
slim-mobilenet	64	3,172.10	3,824.46	-17.06%	:red_circle:
slim-nasnetalarge	64	91.08	97.02	-6.12%	:red_circle:
slim-resnet50v2	64	998.48	1,651.41	-39.54%	:red_circle:
bert-mrpc-onnx	8	570.25	589.10	-3.20%	:red_circle:
bert-mrpc-tf	1	286.89	288.10	-0.42%	:white_check_mark:
pytorch-examples-wlang-gru	1	333.47	300.11	11.12%	:high_brightness:
pytorch-examples-wlang-lstm	1	267.31	265.42	0.71%	:white_check_mark:
torchvision-resnet50_1	1	455.44	437.05	4.21%	:high_brightness:
cadene-dpn92_1	1	244.22	243.98	0.10%	:white_check_mark:
cadene-resnext101_1	1	187.20	189.16	-1.03%	:white_check_mark:
onnx-taau-downsample	1	190.07	204.02	-6.84%	:red_circle:
dlrm-criteoterabyte	1	10.64	22.24	-52.16%	:red_circle:
dlrm-criteoterabyte_fp16	1	38.45	41.43	-7.20%	:red_circle:
agentmodel	1	5,871.30	6,319.63	-7.09%	:red_circle:
unet_fp16	2	33.59	33.77	-0.52%	:white_check_mark:
resnet50v1_fp16	1	527.96	549.27	-3.88%	:red_circle:
resnet50v1_int8	1	458.55	456.03	0.55%	:white_check_mark:
bert_base_cased_fp16	64	617.62	620.67	-0.49%	:white_check_mark:
bert_large_uncased_fp16	32	193.87	193.76	0.06%	:white_check_mark:
bert_large_fp16	1	103.80	103.76	0.04%	:white_check_mark:
distilgpt2_fp16	16	1,188.12	1,186.96	0.10%	:white_check_mark:
yolov5s	1	297.82	298.20	-0.13%	:white_check_mark:
tinyllama	1	23.32	23.32	0.02%	:white_check_mark:
vicuna-fastchat	1	133.31	133.50	-0.15%	:white_check_mark:
whisper-tiny-encoder	1	241.09	240.87	0.09%	:white_check_mark:
whisper-tiny-decoder	1	89.16	245.38	-63.67%	:red_circle:

This build is not recommended to merge :red_circle:

Mar 06 '24 22:03 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Mar 06 '24 22:03 migraphx-bot

AMDMIGraphX AMDMIGraphX copied to clipboard

Integrate hipblaslt calls in gemm op

Codecov Report

AMDMIGraphX
AMDMIGraphX copied to clipboard