onnxruntime Add FusedConv Op to ROCm

Description: This add FusedConv op to ROCm and enables the corresponding unit tests for ROCm.

Motivation and Context

Why is this change required? What problem does it solve?
- The FusedConv is missing from ROCm contrib_ops, and this operator is used by ORT converter (See discussion in #11476)

Jun 08 '22 22:06 xinyazhang

pls resolve the conflicts first, thx

Jul 14 '22 20:07 ytaous

pls resolve the conflicts first, thx

Updated @ytaous

Jul 16 '22 00:07 xinyazhang

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline

Jul 16 '22 04:07 ytaous

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline

Jul 16 '22 04:07 ytaous

Azure Pipelines successfully started running 8 pipeline(s).

Jul 16 '22 04:07 azure-pipelines[bot]

Azure Pipelines successfully started running 9 pipeline(s).

Jul 16 '22 04:07 azure-pipelines[bot]

LGTM - @iK1D @weixingzhang - can one of u pls also take a look? thx

Jul 18 '22 19:07 ytaous

The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.

Jul 19 '22 04:07 centwang

The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.

Agree, current implementation does not do actual fusion, within FusedConv it still computes separately. cudnnConvolutionBiasActivationForward equivalent for MIOpen is missing for now, not sure whether MIOpen will later support it. Currently MIOpen has Fusion API that can support such fusion and may be worth a try.

Jul 19 '22 06:07 mindest

The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.

The real problem is not about the performance. It's about whether a model can run on onnxruntime or not. More specifically, without FusedConv, the model from torch_ort.ORTModel(torchvision.models.resnet50) will have inference error "Failed to find kernel for FusedConv".

Jul 19 '22 16:07 xinyazhang

I'm looking into the Fusion API and trying to implement a real fused version of FusedConv

Jul 19 '22 16:07 xinyazhang

The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.

The real problem is not about the performance. It's about whether a model can run on onnxruntime or not. More specifically, without FusedConv, the model from torch_ort.ORTModel(torchvision.models.resnet50) will have inference error "Failed to find kernel for FusedConv".

Most likely the FusedConv is introduced by this line: https://github.com/microsoft/onnxruntime/blob/4f106d2b3b2f55f3bd74f0f6bcbbcd4c304b9231/onnxruntime/core/optimizer/graph_transformer_utils.cc#L237. So the quick test is to comment out this line to see if the model can run or not. If yes, then the quick way to fix this is to not enable it to ROCm (currently it is enabled for CPU, CUDA, ROCm, ACL, ARMNN EPs by cpu_cuda_rocm_acl_armnn_eps). Of cause for better perf, a real fuse version of FusedConv is the best solution if possible.

Jul 20 '22 00:07 centwang

I just pushed a revised version of FusedConv which uses Fusion API if possible.

Due to limitations of MIOpen I kept the old code path as the fall back. For example, the new code path cannot handle FusedConvTest.Conv2D_Bias_Z_Relu, which requires two bias layers for both Z and Bias. This is not supported as of ROCM 5.1 (not tested on 5.2 yet).

Jul 20 '22 01:07 xinyazhang

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline

Jul 20 '22 02:07 ytaous

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline

Jul 20 '22 02:07 ytaous

Azure Pipelines successfully started running 9 pipeline(s).

Jul 20 '22 02:07 azure-pipelines[bot]

Azure Pipelines successfully started running 8 pipeline(s).

Jul 20 '22 02:07 azure-pipelines[bot]

I noticed a flaw in previous Fusion API usage, and sent a new commit to address this problem. (See the commit message for the detail)

This change is left as a separate commit intentionally since we can revert this commit once the support for ancient MIOpen were deprecated and removed in the future.

Jul 20 '22 23:07 xinyazhang

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline

Jul 20 '22 23:07 ytaous

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline

Jul 20 '22 23:07 ytaous

Azure Pipelines successfully started running 9 pipeline(s).

Jul 20 '22 23:07 azure-pipelines[bot]

Azure Pipelines successfully started running 8 pipeline(s).

Jul 20 '22 23:07 azure-pipelines[bot]

FYI - in progress - https://github.com/microsoft/onnxruntime/pull/12410

Aug 01 '22 20:08 ytaous

@ytaous @iK1D @mindest Updated a new version with full features.

The only remaining is due to the interface limitation of MIOpen the spatial dimension of convolution operator can only be guessed, and errors like MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/convolution_api.cpp:205: requestedSpatialDim is larger than actual spatial dimension will be printed during the computing of the hashing.

Aug 09 '22 07:08 xinyazhang

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline

Aug 09 '22 17:08 ytaous

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline

Aug 09 '22 17:08 ytaous

Azure Pipelines successfully started running 9 pipeline(s).

Aug 09 '22 17:08 azure-pipelines[bot]

Azure Pipelines successfully started running 8 pipeline(s).

Aug 09 '22 17:08 azure-pipelines[bot]

Thanks for your contribution.

Aug 12 '22 07:08 ytaous

When I use the onnxruntime（last main branch v1.1.4.0 ）with rocm provider to run the Yolo series network, I find that the fusion operator of conv+leakyrelu is still not working properly. The error message is related to miopen. The error message is as follows: 2023-02-02 03:04:49.384769991 [E:onnxruntime:Default, rocm_call.cc:122 RocmCall] MIOPEN failure 1: miopenStatusNotInitialized ; GPU=0 ; hostname=dcunode3 ; expr=status_; 2023-02-02 03:04:49.384854310 [E:onnxruntime:, sequential_executor.cc:369 Execute] Non-zero status code returned while running FusedConv node. Name:'Conv_4' Status Message: MIOPEN failure 1: miopenStatusNotInitialized ; GPU=0 ; hostname=dcunode3 ; expr=status_; Exception in thread Thread-1: ... onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'Conv_4' Status Message: MIOPEN failure 1: miopenStatusNotInitialized ; GPU=0 ; hostname=dcunode3 ; expr=status_; ·rocm version 5.2.0

Feb 02 '23 03:02 twoapples1

onnxruntime onnxruntime copied to clipboard

Add FusedConv Op to ROCm

onnxruntime
onnxruntime copied to clipboard