AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

Fix for Lower unsupported pooling sizes for the CPU to Reference backend #3177

Open aditya-167 opened this issue 1 year ago • 10 comments

Hi @pfultz2,@CharlieL7

I am working on : https://github.com/ROCm/AMDMIGraphX/issues/3177 (this is my first PR) as suggested by @umangyadav that you would help and review my changes.

Problem Description

OneDNN has a known limitation when handling specific pooling configurations related to padding, stride, and kernel size.

Solution Overview

OneDNN fails for certain combinations such as padding = {2}, stride = {1}, and lengths = {2}. This configuration leads to a failure in the pooling operation. OneDNN has a kernel size limitation: The referenced OneDNN documentation specifies a maximum dimension size of 14 for the pooling kernel (referred to as the "weights tensor"). This limitation is not currently enforced by MIGraphX, which can lead to failures during execution. The goal of this PR is to address these limitations by detecting such problematic pooling configurations and falling back to the reference backend (a CPU-based implementation) when OneDNN is unable to execute the pooling operation.

Condition Checking:

The pooling operator is analyzed to check if the pooling configuration violates OneDNN's known limitations: Check if the kernel size (lengths) exceeds the maximum size of 14. Check for combinations of padding, stride, and kernel size that are known to fail with OneDNN. If OneDNN cannot execute the pooling operation due to these limitations, the code falls back to the reference backend (a CPU-based pooling implementation).

Fallback Mechanism:

For invalid configurations, the pooling operation is replaced with a reference backend pooling operator (ref::pooling). Valid configurations continue to use the OneDNN pooling operation (dnnl::pooling).

Code Changes:

https://github.com/aditya-167/AMDMIGraphX/blob/054055f93d6e65aa8b3d7c46a701156b97d15b8d/src/targets/cpu/lowering.cpp#L432C5-L453C10

Please suggest if any any improvment needed (this is my first PR) :-) , also let me know I how to test this as I was thinking of creating a test file for the same however I was wondering if there exists already one in the repo that I can add to ? but I couldnt find one for this operation.

aditya-167 avatar Sep 22 '24 22:09 aditya-167

Looks good overall. Add a verify test in the test/verify/ to make sure this works. Verify tests run the whole compilation process on all targets. The test should look something like this: test/verify/test_pooling_autopad.cpp`

okay, I have added a test: test_pooling_fallback.cpp that checks for 3 cases, one for valid other 2 for kernel size >14 & invalid pooling that should trigger fallback

aditya-167 avatar Sep 23 '24 20:09 aditya-167

Failing to compile, needs fix. Check CI runs for the errors.

CharlieL7 avatar Sep 24 '24 15:09 CharlieL7

Failing to compile, needs fix. Check CI runs for the errors.

I have updated the tests, seems like I was handling it in not the the right way. compiles in my local repo..

aditya-167 avatar Sep 24 '24 21:09 aditya-167

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.05%. Comparing base (b9fe915) to head (e5d3201). Report is 26 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3468   +/-   ##
========================================
  Coverage    92.05%   92.05%           
========================================
  Files          509      509           
  Lines        21014    21014           
========================================
  Hits         19345    19345           
  Misses        1669     1669           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Sep 25 '24 15:09 codecov[bot]

I'm going to help this PR along and make some changes to pass the rest of the CI.

CharlieL7 avatar Sep 27 '24 15:09 CharlieL7

I'm going to help this PR along and make some changes to pass the rest of the CI.

thanks @CharlieL7 .. I reviewed the CI.. it fails at cppcheck and format.. not sure how to resolve this, it compiles and tests completely in my local..

aditya-167 avatar Sep 27 '24 15:09 aditya-167

It's not clear to me what the actual limitation for this kernel is from OneDNN. Umang did write a TODO comment, but the link provided does not mention such limitations. Need to find a source that lays out the limitation or other proof that OneDNN cannot handle these cases.

It's not clear to me what the actual limitation for this kernel is from OneDNN. Umang did write a TODO comment, but the link provided does not mention such limitations. Need to find a source that lays out the limitation or other proof that OneDNN cannot handle these cases.

Earlier I was referring to https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html#doxid-dev-guide-convolution if you take a look at oneDNN algorithm section, there you may see "Weights tensor width does not exceed 14." I think I mixed up few things here.. I need more digging into this and test few cases where it fails..

aditya-167 avatar Sep 27 '24 15:09 aditya-167

Right, I see the limit for the direct algorithm for convolution; but this is for pooling. Technically pooling can be implemented as a convolution, but the pooling page doesn't mention anything about having the same limitations.

CharlieL7 avatar Sep 27 '24 16:09 CharlieL7

@pfultz2 as rightly pointed by @CharlieL7, the doc OneDNN spec: https://oneapi-src.github.io/oneDNN/dev_guide_pooling.html#doxid-dev-guide-pooling-1dg-pool-impl-limits doesnt explicitly mention about length size and stride for pooling, earlier I was referring to conv.. however in the comments @umangyadav mentioned about two cases on padding, when length = {3} fails. Now my question is do I have to test each configuration for pooling, and check when it first fails, and put on those condition to fallback on cpu? I am thinking of using binary search for a range of lengths, where first it starts to fail and put the condition for refernce fallback cpu.

aditya-167 avatar Oct 09 '24 03:10 aditya-167

Test Batch Rate new
b9fe91
Rate old
0d07c2
Diff Compare
torchvision-resnet50 64 3,263.47 3,260.52 0.09% :white_check_mark:
torchvision-resnet50_fp16 64 6,994.34 6,982.39 0.17% :white_check_mark:
torchvision-densenet121 32 2,432.66 2,430.89 0.07% :white_check_mark:
torchvision-densenet121_fp16 32 4,091.62 4,067.06 0.60% :white_check_mark:
torchvision-inceptionv3 32 1,638.11 1,636.77 0.08% :white_check_mark:
torchvision-inceptionv3_fp16 32 2,764.51 2,755.18 0.34% :white_check_mark:
cadene-inceptionv4 16 776.17 775.67 0.06% :white_check_mark:
cadene-resnext64x4 16 808.64 808.09 0.07% :white_check_mark:
slim-mobilenet 64 7,534.19 7,531.83 0.03% :white_check_mark:
slim-nasnetalarge 64 211.48 211.39 0.04% :white_check_mark:
slim-resnet50v2 64 3,501.30 3,495.72 0.16% :white_check_mark:
bert-mrpc-onnx 8 1,151.19 1,140.79 0.91% :white_check_mark:
bert-mrpc-tf 1 494.24 447.09 10.55% :high_brightness:
pytorch-examples-wlang-gru 1 423.82 357.24 18.64% :high_brightness:
pytorch-examples-wlang-lstm 1 377.64 320.71 17.75% :high_brightness:
torchvision-resnet50_1 1 788.81 776.99 1.52% :white_check_mark:
cadene-dpn92_1 1 401.12 430.45 -6.81% :red_circle:
cadene-resnext101_1 1 380.29 378.42 0.49% :white_check_mark:
onnx-taau-downsample 1 342.96 341.70 0.37% :white_check_mark:
dlrm-criteoterabyte 1 33.34 32.05 4.02% :high_brightness:
dlrm-criteoterabyte_fp16 1 52.75 50.61 4.21% :high_brightness:
agentmodel 1 8,426.09 6,696.76 25.82% :high_brightness:
unet_fp16 2 58.86 58.59 0.47% :white_check_mark:
resnet50v1_fp16 1 922.79 840.65 9.77% :high_brightness:
resnet50v1_int8 1 983.99 931.26 5.66% :high_brightness:
bert_base_cased_fp16 64 1,171.06 1,170.54 0.05% :white_check_mark:
bert_large_uncased_fp16 32 363.59 363.52 0.02% :white_check_mark:
bert_large_fp16 1 200.40 196.02 2.24% :white_check_mark:
distilgpt2_fp16 16 2,203.86 2,199.51 0.20% :white_check_mark:
yolov5s 1 541.58 536.04 1.03% :white_check_mark:
tinyllama 1 43.44 43.33 0.27% :white_check_mark:
vicuna-fastchat 1 170.68 171.81 -0.65% :white_check_mark:
whisper-tiny-encoder 1 418.42 416.70 0.41% :white_check_mark:
whisper-tiny-decoder 1 428.82 418.85 2.38% :white_check_mark:

This build is not recommended to merge :red_circle:

migraphx-bot avatar Oct 09 '24 14:10 migraphx-bot


     :white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance
     :white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance
     :white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance
     :white_check_mark: unet: PASSED: MIGraphX meets tolerance
     :white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

     :white_check_mark: bert_large: PASSED: MIGraphX meets tolerance
     :white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance
     :white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance
     :white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

migraphx-bot avatar Oct 09 '24 14:10 migraphx-bot

Now my question is do I have to test each configuration for pooling, and check when it first fails, and put on those condition to fallback on cpu? I am thinking of using binary search for a range of lengths, where first it starts to fail and put the condition for refernce fallback cpu.

Hello @aditya-167, your search idea sounds like a good plan. It might be faster to ask the OneDNN team for the limitations. Also, we're looking in to changing over to ZenDNN (https://github.com/amd/ZenDNN) over OneDNN in the future.

CharlieL7 avatar Oct 21 '24 20:10 CharlieL7