onnxruntime
onnxruntime copied to clipboard
Add FusedConv Op to ROCm
Description: This add FusedConv op to ROCm and enables the corresponding unit tests for ROCm.
Motivation and Context
- Why is this change required? What problem does it solve?
- The FusedConv is missing from ROCm contrib_ops, and this operator is used by ORT converter (See discussion in #11476)
pls resolve the conflicts first, thx
pls resolve the conflicts first, thx
Updated @ytaous
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline
Azure Pipelines successfully started running 8 pipeline(s).
Azure Pipelines successfully started running 9 pipeline(s).
LGTM - @iK1D @weixingzhang - can one of u pls also take a look? thx
The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.
The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.
Agree, current implementation does not do actual fusion, within FusedConv
it still computes separately. cudnnConvolutionBiasActivationForward
equivalent for MIOpen is missing for now, not sure whether MIOpen will later support it. Currently MIOpen has Fusion API that can support such fusion and may be worth a try.
The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.
The real problem is not about the performance. It's about whether a model can run on onnxruntime or not.
More specifically, without FusedConv, the model from torch_ort.ORTModel(torchvision.models.resnet50)
will have inference error "Failed to find kernel for FusedConv".
I'm looking into the Fusion API and trying to implement a real fused version of FusedConv
The actual impl is not real fused one because MIOpen doesn't have related cudnnConvolutionBiasActivationForward. So I am thinking we just don't enable the ConvActivationFusion on ROCm so we don't need this kernel for now. Once we have similar cudnnConvolutionBiasActivationForward in MIOpen, we then can enable the optimizer back.
The real problem is not about the performance. It's about whether a model can run on onnxruntime or not. More specifically, without FusedConv, the model from
torch_ort.ORTModel(torchvision.models.resnet50)
will have inference error "Failed to find kernel for FusedConv".
Most likely the FusedConv is introduced by this line: https://github.com/microsoft/onnxruntime/blob/4f106d2b3b2f55f3bd74f0f6bcbbcd4c304b9231/onnxruntime/core/optimizer/graph_transformer_utils.cc#L237. So the quick test is to comment out this line to see if the model can run or not. If yes, then the quick way to fix this is to not enable it to ROCm (currently it is enabled for CPU, CUDA, ROCm, ACL, ARMNN EPs by cpu_cuda_rocm_acl_armnn_eps). Of cause for better perf, a real fuse version of FusedConv is the best solution if possible.
I just pushed a revised version of FusedConv which uses Fusion API if possible.
Due to limitations of MIOpen I kept the old code path as the fall back.
For example, the new code path cannot handle FusedConvTest.Conv2D_Bias_Z_Relu
, which requires two bias layers for both Z and Bias.
This is not supported as of ROCM 5.1 (not tested on 5.2 yet).
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline
Azure Pipelines successfully started running 9 pipeline(s).
Azure Pipelines successfully started running 8 pipeline(s).
I noticed a flaw in previous Fusion API usage, and sent a new commit to address this problem. (See the commit message for the detail)
This change is left as a separate commit intentionally since we can revert this commit once the support for ancient MIOpen were deprecated and removed in the future.
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline
Azure Pipelines successfully started running 9 pipeline(s).
Azure Pipelines successfully started running 8 pipeline(s).
FYI - in progress - https://github.com/microsoft/onnxruntime/pull/12410
@ytaous @iK1D @mindest Updated a new version with full features.
The only remaining is due to the interface limitation of MIOpen the spatial dimension of convolution operator can only be guessed, and errors like MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/convolution_api.cpp:205: requestedSpatialDim is larger than actual spatial dimension
will be printed during the computing of the hashing.
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline
Azure Pipelines successfully started running 9 pipeline(s).
Azure Pipelines successfully started running 8 pipeline(s).
Thanks for your contribution.
When I use the onnxruntime(last main branch v1.1.4.0 )with rocm provider to run the Yolo series network, I find that the fusion operator of conv+leakyrelu is still not working properly. The error message is related to miopen. The error message is as follows:
2023-02-02 03:04:49.384769991 [E:onnxruntime:Default, rocm_call.cc:122 RocmCall] MIOPEN failure 1: miopenStatusNotInitialized ; GPU=0 ; hostname=dcunode3 ; expr=status_; 2023-02-02 03:04:49.384854310 [E:onnxruntime:, sequential_executor.cc:369 Execute] Non-zero status code returned while running FusedConv node. Name:'Conv_4' Status Message: MIOPEN failure 1: miopenStatusNotInitialized ; GPU=0 ; hostname=dcunode3 ; expr=status_; Exception in thread Thread-1: ... onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'Conv_4' Status Message: MIOPEN failure 1: miopenStatusNotInitialized ; GPU=0 ; hostname=dcunode3 ; expr=status_;
·rocm version 5.2.0