AMDMIGraphX Warnings upon tuning information mismatch for Convolutions

This PR adds MIGRAPHX_ENABLE_TUNING_WARNINGS environment flag. Based on this flag it would warn/error out if saved Tuning information has a different version, and architecture setup compared to the machine on which the model is being run.

It warns on the following cases :

MIOpen version mismatch. It relies on miopenStatusVersionMismatch return code from calling loadSolution of the MIOpen.
GFX Arch and CU count mismatch for the compiled and saved code.
MIGraphX version mismatch

Oct 19 '22 19:10 umangyadav

Test Rate new
4b4566 Rate old
7d6227 Diff Compare

torchvision-resnet50 2,246.84 2,246.48 0.02% :white_check_mark:

torchvision-resnet50_fp16 4,866.33 4,867.40 -0.02% :white_check_mark:

torchvision-alexnet 4,975.74 4,978.41 -0.05% :red_circle:

torchvision-alexnet_fp16 25,423.36 25,334.39 0.35% :high_brightness:

torchvision-densenet121 1,817.19 1,812.31 0.27% :high_brightness:

torchvision-densenet121_fp16 3,262.29 3,265.52 -0.10% :white_check_mark:

torchvision-inceptionv3 1,145.78 1,145.55 0.02% :white_check_mark:

torchvision-inceptionv3_fp16 2,057.41 2,052.23 0.25% :white_check_mark:

torchvision-vgg16 895.83 895.71 0.01% :white_check_mark:

torchvision-vgg16_fp16 1,742.92 1,743.37 -0.03% :white_check_mark:

cadene-inceptionv4 540.47 540.56 -0.02% :white_check_mark:

cadene-resnext64x4 579.73 579.55 0.03% :high_brightness:

slim-mobilenet 7,166.68 7,168.35 -0.02% :white_check_mark:

slim-nasnetalarge 220.08 220.22 -0.07% :white_check_mark:

slim-resnet50v2 2,341.93 2,338.65 0.14% :high_brightness:

bert-mrpc-onnx 807.67 807.10 0.07% :high_brightness:

bert-mrpc-tf 302.54 302.53 0.00% :white_check_mark:

pytorch-examples-wlang-gru 321.06 310.47 3.41% :high_brightness:

pytorch-examples-wlang-lstm 342.56 344.19 -0.47% :white_check_mark:

torchvision-resnet50_1 516.36 517.21 -0.17% :white_check_mark:

torchvision-inceptionv3_1 308.36 307.79 0.19% :white_check_mark:

torchvision-vgg16_1 463.06 462.89 0.04% :white_check_mark:

cadene-dpn92_1 315.18 319.73 -1.42% :red_circle:

cadene-resnext101_1 234.48 234.63 -0.06% :white_check_mark:

slim-vgg16_1 64.03 64.00 0.06% :high_brightness:

slim-mobilenet_1 1,482.46 1,480.44 0.14% :white_check_mark:

slim-inceptionv4_1 194.09 195.48 -0.71% :red_circle:

onnx-taau-downsample 295.87 295.70 0.06% :white_check_mark:

Test	Rate new 4b4566	Rate old 7d6227	Diff	Compare
torchvision-resnet50	2,246.84	2,246.48	0.02%	:white_check_mark:
torchvision-resnet50_fp16	4,866.33	4,867.40	-0.02%	:white_check_mark:
torchvision-alexnet	4,975.74	4,978.41	-0.05%	:red_circle:
torchvision-alexnet_fp16	25,423.36	25,334.39	0.35%	:high_brightness:
torchvision-densenet121	1,817.19	1,812.31	0.27%	:high_brightness:
torchvision-densenet121_fp16	3,262.29	3,265.52	-0.10%	:white_check_mark:
torchvision-inceptionv3	1,145.78	1,145.55	0.02%	:white_check_mark:
torchvision-inceptionv3_fp16	2,057.41	2,052.23	0.25%	:white_check_mark:
torchvision-vgg16	895.83	895.71	0.01%	:white_check_mark:
torchvision-vgg16_fp16	1,742.92	1,743.37	-0.03%	:white_check_mark:
cadene-inceptionv4	540.47	540.56	-0.02%	:white_check_mark:
cadene-resnext64x4	579.73	579.55	0.03%	:high_brightness:
slim-mobilenet	7,166.68	7,168.35	-0.02%	:white_check_mark:
slim-nasnetalarge	220.08	220.22	-0.07%	:white_check_mark:
slim-resnet50v2	2,341.93	2,338.65	0.14%	:high_brightness:
bert-mrpc-onnx	807.67	807.10	0.07%	:high_brightness:
bert-mrpc-tf	302.54	302.53	0.00%	:white_check_mark:
pytorch-examples-wlang-gru	321.06	310.47	3.41%	:high_brightness:
pytorch-examples-wlang-lstm	342.56	344.19	-0.47%	:white_check_mark:
torchvision-resnet50_1	516.36	517.21	-0.17%	:white_check_mark:
torchvision-inceptionv3_1	308.36	307.79	0.19%	:white_check_mark:
torchvision-vgg16_1	463.06	462.89	0.04%	:white_check_mark:
cadene-dpn92_1	315.18	319.73	-1.42%	:red_circle:
cadene-resnext101_1	234.48	234.63	-0.06%	:white_check_mark:
slim-vgg16_1	64.03	64.00	0.06%	:high_brightness:
slim-mobilenet_1	1,482.46	1,480.44	0.14%	:white_check_mark:
slim-inceptionv4_1	194.09	195.48	-0.71%	:red_circle:
onnx-taau-downsample	295.87	295.70	0.06%	:white_check_mark:

This build is not recommended to merge :red_circle:

Oct 19 '22 20:10 migraphx-bot

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.74%. Comparing base (7d62275) to head (d68b2fe). Report is 711 commits behind head on develop.

:exclamation: Current head d68b2fe differs from pull request most recent head 4b45667. Consider uploading reports for the commit 4b45667 to get more accurate results

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #1417   +/-   ##
========================================
  Coverage    92.74%   92.74%           
========================================
  Files          447      447           
  Lines        14943    14948    +5     
========================================
+ Hits         13859    13864    +5     
  Misses        1084     1084

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Oct 19 '22 21:10 codecov[bot]

Apart from CI this looks good.

Yes, i'll fix tidy.

Just to be clear, since there isn't a description You're adding better warnings for the convolution operators and based on MIOpen version, or CU core count

warning are for the cases where tuning information stored in mxr file doesn't match the user's configuration. In that case, users are required to enable tuning flag to get optimal performance.

Unit test change is to handle the extra args you added (gfx_arch, cu_count, miopen_version) from how things are serialized?

Yes, size of thecontext changes because of that.

Oct 21 '22 14:10 umangyadav

Apart from CI this looks good.

Yes, i'll fix tidy.

Just to be clear, since there isn't a description You're adding better warnings for the convolution operators and based on MIOpen version, or CU core count

warning are for the cases where tuning information stored in mxr file doesn't match the user's configuration. In that case, users are required to enable tuning flag to get optimal performance.

Unit test change is to handle the extra args you added (gfx_arch, cu_count, miopen_version) from how things are serialized?

Yes, size of thecontext changes because of that.

All good. Just approved looks good. Just wanted to make sure I understood everything here.

Oct 21 '22 14:10 TedThemistokleous

We should throw an error for different gfx arch when the model has binary code objects since the model cannot run on a different arch. We will need to check this in finalize so we aren't throwing errors when they aren't applicable.

~I disagree on this one. Users should be able to share/transfer models across different gfx-arch. Erroring out would not allow that.~

We could consider adding a logger similar to MIOpen with different levels. So the warnings will show by default but we can have higher log level the user can show to enable this less important info(such as CUs or MIOpen version difference).

Logging with levels is a good idea. I can work on Logging with different levels in a separate work item. But i think for now, we should keep warnings for MIOpen version and CUs mismatch.

Oct 28 '22 22:10 umangyadav

Logging with levels is a good idea. I can work on Logging with different levels in a separate work item. But i think for now, we should keep warnings for MIOpen version and CUs mismatch.

You can still put the warnings behind an env variable. Also the warnings should probably go in finalize so it warns when MIOpen is actually used. They are completely pointless when using BERT.

Oct 28 '22 22:10 pfultz2