Get error while using Dml EP
Describe the issue
Based on https://github.com/instant-high/deoldify-onnx, we tried to deploy the model by using Dml EP. The fp32 onnx model runs well. We tried to convert it to fp16 onnx model by using float16.convert_float_to_float16. There is no problem with CPU EP. However, we would observe runtime error when we use Dml EP.
Error Message: (D:\conda_envs\deoldify) D:\deoldify-onnx>python video.py --source "video.mp4" --result "video_colorized.mp4" 2024-05-20 16:17:13.7289822 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime_pybind11_state.pyd!00007FFE6F0124D1: (caller: 00007FFE6EFE0CF5) Exception(1) tid(3d0c) 80070057 The parameter is incorrect.
Traceback (most recent call last):
File "D:\deoldify-onnx\video.py", line 43, in
To reproduce
- git clone https://github.com/instant-high/deoldify-onnx.git
- Download the model from https://drive.google.com/drive/folders/1bU9Zj7zGVEujIzvDTb1b9cyWU3s__WQR?usp=sharing
- Modify color/deoldify.py and color/deoldify_fp16.py to make them use Dml EP
- Convert the model to fp16 mode by using float16.convert_float_to_float16.
- Command: python image.py --source_image "image.jpg" or python video.py --source "video.mp4" --result "video_colorized.mp4"
For the fp32 model, it runs well with Dml EP and CPU EP. Convert it to fp16 model, it runs well with CPU EP but we would observe runtime error while using Dml EP.
Urgency
No response
Platform
Windows
OS Version
22631.3593
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.3
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
onnxruntime-directml 1.18.0
We are also seeing this with some of our Fp16 models. Our models run fine on RTX-20XX, RTX-30XX, and RTX-40XX, but they seem to fail on all AMD cards and some GTX-10XX cards. As a reference, these models all ran fine under the onnxruntime 1.13.1, but when we switched to 1.17.1, they started failing.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
This is still a genuine bug; I have worked around it by doing this during session configuration.
status = sOrtAPI->AddSessionConfigEntry(inOptions, kOrtSessionOptionsConfigDisableDmlGraphFusion, "1");
But this causes severe performance degradations, three times slower for at least one of our models, even on good hardware, an NVIDIA RTX 3070 mobile GPU. This same bug or similar issues have been written up more than once by more than one person using more than one model.
#21205 #20742 #20575
This regression seams to have been introduced with this pull request: #13131 or maybe #18160.
status = sOrtAPI->AddSessionConfigEntry(inOptions, kOrtSessionOptionsConfigDisableDmlGraphFusion, "1");
I experience the same issue with DirectML (inferencing whisper). Adding the following didn't helped me.
I found the root cause of that issue.
onnxruntime of directml depends on DirectML.dll and uses old DLL from system directory.
The solution is to download newer DirectML.dll from nuget.org/packages/Microsoft.AI.DirectML
Choose the button Download package.
Unzip it and move the file DirectML.dll inside bin/x64-win folder into your executable folder
@thewh1teagle, I think we are seeing a different problem. I directly create the DirectML device using LoadLibraryEx, making sure to load the DirectML.dll (1.13.1) associated with the onnxruntime (1.17.1) and use GetProcAddress to get the function DMLCreateDevice1 then call SessionOptionsAppendExecutionProvider_DML1 with the device I create.
SessionOptionsAppendExecutionProvider_DML1
@siegelaaron94 Yes. We got the same exception as "DmlGraphFusionHelper.cpp 80070057 The parameter is incorrect". After debug we found that it was IDMLDevive::CreateOperater, with DML_OPERATOR_GEMM operator type and the channel count of ATensor shape is bigger than 65535 the IDMLDevive::CreateOperater will return E_INVALIDARG (We test the input shape {65536, 1, 256}).
Any one known is that a limitation of 65535 channel count? Thanks.
I also get this error but the error is not a consistent one. It occurs sometime for running the same application and doesn't on other times.
I use the onnxruntime and direcML dll produced after downloading the pip package
pip install onnxruntime-directml
Let me know if anyone was able to figure the root cause and fix it.