Hi @pfultz2,@CharlieL7

I am working on : https://github.com/ROCm/AMDMIGraphX/issues/3177 (this is my first PR) as suggested by @umangyadav that you would help and review my changes.

Problem Description

OneDNN has a known limitation when handling specific pooling configurations related to padding, stride, and kernel size.

Solution Overview

OneDNN fails for certain combinations such as padding = {2}, stride = {1}, and lengths = {2}. This configuration leads to a failure in the pooling operation. OneDNN has a kernel size limitation: The referenced OneDNN documentation specifies a maximum dimension size of 14 for the pooling kernel (referred to as the "weights tensor"). This limitation is not currently enforced by MIGraphX, which can lead to failures during execution. The goal of this PR is to address these limitations by detecting such problematic pooling configurations and falling back to the reference backend (a CPU-based implementation) when OneDNN is unable to execute the pooling operation.

Condition Checking:

The pooling operator is analyzed to check if the pooling configuration violates OneDNN's known limitations: Check if the kernel size (lengths) exceeds the maximum size of 14. Check for combinations of padding, stride, and kernel size that are known to fail with OneDNN. If OneDNN cannot execute the pooling operation due to these limitations, the code falls back to the reference backend (a CPU-based pooling implementation).

Fallback Mechanism:

For invalid configurations, the pooling operation is replaced with a reference backend pooling operator (ref::pooling). Valid configurations continue to use the OneDNN pooling operation (dnnl::pooling).

Code Changes:

https://github.com/aditya-167/AMDMIGraphX/blob/054055f93d6e65aa8b3d7c46a701156b97d15b8d/src/targets/cpu/lowering.cpp#L432C5-L453C10

Please suggest if any any improvment needed (this is my first PR) :-) , also let me know I how to test this as I was thinking of creating a test file for the same however I was wondering if there exists already one in the repo that I can add to ? but I couldnt find one for this operation.

Sep 22 '24 22:09 aditya-167

Looks good overall. Add a verify test in the test/verify/ to make sure this works. Verify tests run the whole compilation process on all targets. The test should look something like this: test/verify/test_pooling_autopad.cpp`

okay, I have added a test: test_pooling_fallback.cpp that checks for 3 cases, one for valid other 2 for kernel size >14 & invalid pooling that should trigger fallback

Sep 23 '24 20:09 aditya-167

Failing to compile, needs fix. Check CI runs for the errors.

Sep 24 '24 15:09 CharlieL7

Failing to compile, needs fix. Check CI runs for the errors.

I have updated the tests, seems like I was handling it in not the the right way. compiles in my local repo..

Sep 24 '24 21:09 aditya-167

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.05%. Comparing base (b9fe915) to head (e5d3201). Report is 26 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3468   +/-   ##
========================================
  Coverage    92.05%   92.05%           
========================================
  Files          509      509           
  Lines        21014    21014           
========================================
  Hits         19345    19345           
  Misses        1669     1669

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 25 '24 15:09 codecov[bot]

I'm going to help this PR along and make some changes to pass the rest of the CI.

Sep 27 '24 15:09 CharlieL7

I'm going to help this PR along and make some changes to pass the rest of the CI.

thanks @CharlieL7 .. I reviewed the CI.. it fails at cppcheck and format.. not sure how to resolve this, it compiles and tests completely in my local..

Sep 27 '24 15:09 aditya-167

It's not clear to me what the actual limitation for this kernel is from OneDNN. Umang did write a TODO comment, but the link provided does not mention such limitations. Need to find a source that lays out the limitation or other proof that OneDNN cannot handle these cases.

Earlier I was referring to https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html#doxid-dev-guide-convolution if you take a look at oneDNN algorithm section, there you may see "Weights tensor width does not exceed 14." I think I mixed up few things here.. I need more digging into this and test few cases where it fails..

Sep 27 '24 15:09 aditya-167

Right, I see the limit for the direct algorithm for convolution; but this is for pooling. Technically pooling can be implemented as a convolution, but the pooling page doesn't mention anything about having the same limitations.

Sep 27 '24 16:09 CharlieL7

@pfultz2 as rightly pointed by @CharlieL7, the doc OneDNN spec: https://oneapi-src.github.io/oneDNN/dev_guide_pooling.html#doxid-dev-guide-pooling-1dg-pool-impl-limits doesnt explicitly mention about length size and stride for pooling, earlier I was referring to conv.. however in the comments @umangyadav mentioned about two cases on padding, when length = {3} fails. Now my question is do I have to test each configuration for pooling, and check when it first fails, and put on those condition to fallback on cpu? I am thinking of using binary search for a range of lengths, where first it starts to fail and put the condition for refernce fallback cpu.

Oct 09 '24 03:10 aditya-167

Test Batch Rate new
b9fe91 Rate old
0d07c2 Diff Compare

torchvision-resnet50 64 3,263.47 3,260.52 0.09% :white_check_mark:

torchvision-resnet50_fp16 64 6,994.34 6,982.39 0.17% :white_check_mark:

torchvision-densenet121 32 2,432.66 2,430.89 0.07% :white_check_mark:

torchvision-densenet121_fp16 32 4,091.62 4,067.06 0.60% :white_check_mark:

torchvision-inceptionv3 32 1,638.11 1,636.77 0.08% :white_check_mark:

torchvision-inceptionv3_fp16 32 2,764.51 2,755.18 0.34% :white_check_mark:

cadene-inceptionv4 16 776.17 775.67 0.06% :white_check_mark:

cadene-resnext64x4 16 808.64 808.09 0.07% :white_check_mark:

slim-mobilenet 64 7,534.19 7,531.83 0.03% :white_check_mark:

slim-nasnetalarge 64 211.48 211.39 0.04% :white_check_mark:

slim-resnet50v2 64 3,501.30 3,495.72 0.16% :white_check_mark:

bert-mrpc-onnx 8 1,151.19 1,140.79 0.91% :white_check_mark:

bert-mrpc-tf 1 494.24 447.09 10.55% :high_brightness:

pytorch-examples-wlang-gru 1 423.82 357.24 18.64% :high_brightness:

pytorch-examples-wlang-lstm 1 377.64 320.71 17.75% :high_brightness:

torchvision-resnet50_1 1 788.81 776.99 1.52% :white_check_mark:

cadene-dpn92_1 1 401.12 430.45 -6.81% :red_circle:

cadene-resnext101_1 1 380.29 378.42 0.49% :white_check_mark:

onnx-taau-downsample 1 342.96 341.70 0.37% :white_check_mark:

dlrm-criteoterabyte 1 33.34 32.05 4.02% :high_brightness:

dlrm-criteoterabyte_fp16 1 52.75 50.61 4.21% :high_brightness:

agentmodel 1 8,426.09 6,696.76 25.82% :high_brightness:

unet_fp16 2 58.86 58.59 0.47% :white_check_mark:

resnet50v1_fp16 1 922.79 840.65 9.77% :high_brightness:

resnet50v1_int8 1 983.99 931.26 5.66% :high_brightness:

bert_base_cased_fp16 64 1,171.06 1,170.54 0.05% :white_check_mark:

bert_large_uncased_fp16 32 363.59 363.52 0.02% :white_check_mark:

bert_large_fp16 1 200.40 196.02 2.24% :white_check_mark:

distilgpt2_fp16 16 2,203.86 2,199.51 0.20% :white_check_mark:

yolov5s 1 541.58 536.04 1.03% :white_check_mark:

tinyllama 1 43.44 43.33 0.27% :white_check_mark:

vicuna-fastchat 1 170.68 171.81 -0.65% :white_check_mark:

whisper-tiny-encoder 1 418.42 416.70 0.41% :white_check_mark:

whisper-tiny-decoder 1 428.82 418.85 2.38% :white_check_mark:

Test	Batch	Rate new b9fe91	Rate old 0d07c2	Diff	Compare
torchvision-resnet50	64	3,263.47	3,260.52	0.09%	:white_check_mark:
torchvision-resnet50_fp16	64	6,994.34	6,982.39	0.17%	:white_check_mark:
torchvision-densenet121	32	2,432.66	2,430.89	0.07%	:white_check_mark:
torchvision-densenet121_fp16	32	4,091.62	4,067.06	0.60%	:white_check_mark:
torchvision-inceptionv3	32	1,638.11	1,636.77	0.08%	:white_check_mark:
torchvision-inceptionv3_fp16	32	2,764.51	2,755.18	0.34%	:white_check_mark:
cadene-inceptionv4	16	776.17	775.67	0.06%	:white_check_mark:
cadene-resnext64x4	16	808.64	808.09	0.07%	:white_check_mark:
slim-mobilenet	64	7,534.19	7,531.83	0.03%	:white_check_mark:
slim-nasnetalarge	64	211.48	211.39	0.04%	:white_check_mark:
slim-resnet50v2	64	3,501.30	3,495.72	0.16%	:white_check_mark:
bert-mrpc-onnx	8	1,151.19	1,140.79	0.91%	:white_check_mark:
bert-mrpc-tf	1	494.24	447.09	10.55%	:high_brightness:
pytorch-examples-wlang-gru	1	423.82	357.24	18.64%	:high_brightness:
pytorch-examples-wlang-lstm	1	377.64	320.71	17.75%	:high_brightness:
torchvision-resnet50_1	1	788.81	776.99	1.52%	:white_check_mark:
cadene-dpn92_1	1	401.12	430.45	-6.81%	:red_circle:
cadene-resnext101_1	1	380.29	378.42	0.49%	:white_check_mark:
onnx-taau-downsample	1	342.96	341.70	0.37%	:white_check_mark:
dlrm-criteoterabyte	1	33.34	32.05	4.02%	:high_brightness:
dlrm-criteoterabyte_fp16	1	52.75	50.61	4.21%	:high_brightness:
agentmodel	1	8,426.09	6,696.76	25.82%	:high_brightness:
unet_fp16	2	58.86	58.59	0.47%	:white_check_mark:
resnet50v1_fp16	1	922.79	840.65	9.77%	:high_brightness:
resnet50v1_int8	1	983.99	931.26	5.66%	:high_brightness:
bert_base_cased_fp16	64	1,171.06	1,170.54	0.05%	:white_check_mark:
bert_large_uncased_fp16	32	363.59	363.52	0.02%	:white_check_mark:
bert_large_fp16	1	200.40	196.02	2.24%	:white_check_mark:
distilgpt2_fp16	16	2,203.86	2,199.51	0.20%	:white_check_mark:
yolov5s	1	541.58	536.04	1.03%	:white_check_mark:
tinyllama	1	43.44	43.33	0.27%	:white_check_mark:
vicuna-fastchat	1	170.68	171.81	-0.65%	:white_check_mark:
whisper-tiny-encoder	1	418.42	416.70	0.41%	:white_check_mark:
whisper-tiny-decoder	1	428.82	418.85	2.38%	:white_check_mark:

This build is not recommended to merge :red_circle:

Oct 09 '24 14:10 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Oct 09 '24 14:10 migraphx-bot

Now my question is do I have to test each configuration for pooling, and check when it first fails, and put on those condition to fallback on cpu? I am thinking of using binary search for a range of lengths, where first it starts to fail and put the condition for refernce fallback cpu.

Hello @aditya-167, your search idea sounds like a good plan. It might be faster to ask the OneDNN team for the limitations. Also, we're looking in to changing over to ZenDNN (https://github.com/amd/ZenDNN) over OneDNN in the future.

Oct 21 '24 20:10 CharlieL7

AMDMIGraphX
AMDMIGraphX copied to clipboard

Fix for Lower unsupported pooling sizes for the CPU to Reference backend #3177

Problem Description

Solution Overview

Condition Checking:

Fallback Mechanism:

Code Changes:

Codecov Report

AMDMIGraphX AMDMIGraphX copied to clipboard

Fix for Lower unsupported pooling sizes for the CPU to Reference backend #3177

Problem Description

Solution Overview

Condition Checking:

Fallback Mechanism:

Code Changes:

Codecov Report

AMDMIGraphX
AMDMIGraphX copied to clipboard