onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

Get error while using Dml EP

Open klin2024 opened this issue 1 year ago • 1 comments

Describe the issue

Based on https://github.com/instant-high/deoldify-onnx, we tried to deploy the model by using Dml EP. The fp32 onnx model runs well. We tried to convert it to fp16 onnx model by using float16.convert_float_to_float16. There is no problem with CPU EP. However, we would observe runtime error when we use Dml EP.

Error Message: (D:\conda_envs\deoldify) D:\deoldify-onnx>python video.py --source "video.mp4" --result "video_colorized.mp4" 2024-05-20 16:17:13.7289822 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime_pybind11_state.pyd!00007FFE6F0124D1: (caller: 00007FFE6EFE0CF5) Exception(1) tid(3d0c) 80070057 The parameter is incorrect.

Traceback (most recent call last): File "D:\deoldify-onnx\video.py", line 43, in colorizer = DEOLDIFY(model_path="color/DeoldifyVideo_dyn_fp16.onnx", device="dml") File "D:\deoldify-onnx\color\deoldify_fp16.py", line 16, in init self.session = onnxruntime.InferenceSession(model_path, sess_options=session_options, providers=providers) File "D:\conda_envs\deoldify\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "D:\conda_envs\deoldify\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime_pybind11_state.pyd!00007FFE6F0124D1: (caller: 00007FFE6EFE0CF5) Exception(1) tid(3d0c) 80070057 The parameter is incorrect.

To reproduce

  1. git clone https://github.com/instant-high/deoldify-onnx.git
  2. Download the model from https://drive.google.com/drive/folders/1bU9Zj7zGVEujIzvDTb1b9cyWU3s__WQR?usp=sharing
  3. Modify color/deoldify.py and color/deoldify_fp16.py to make them use Dml EP
  4. Convert the model to fp16 mode by using float16.convert_float_to_float16.
  5. Command: python image.py --source_image "image.jpg" or python video.py --source "video.mp4" --result "video_colorized.mp4"

For the fp32 model, it runs well with Dml EP and CPU EP. Convert it to fp16 model, it runs well with CPU EP but we would observe runtime error while using Dml EP.

Urgency

No response

Platform

Windows

OS Version

22631.3593

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.3

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

onnxruntime-directml 1.18.0

klin2024 avatar May 20 '24 23:05 klin2024

We are also seeing this with some of our Fp16 models. Our models run fine on RTX-20XX, RTX-30XX, and RTX-40XX, but they seem to fail on all AMD cards and some GTX-10XX cards. As a reference, these models all ran fine under the onnxruntime 1.13.1, but when we switched to 1.17.1, they started failing.

siegelaaron94 avatar May 21 '24 16:05 siegelaaron94

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions[bot] avatar Jun 21 '24 15:06 github-actions[bot]

This is still a genuine bug; I have worked around it by doing this during session configuration.

status = sOrtAPI->AddSessionConfigEntry(inOptions, kOrtSessionOptionsConfigDisableDmlGraphFusion, "1");

But this causes severe performance degradations, three times slower for at least one of our models, even on good hardware, an NVIDIA RTX 3070 mobile GPU. This same bug or similar issues have been written up more than once by more than one person using more than one model.

#21205 #20742 #20575

This regression seams to have been introduced with this pull request: #13131 or maybe #18160.

siegelaaron94 avatar Jul 12 '24 19:07 siegelaaron94

status = sOrtAPI->AddSessionConfigEntry(inOptions, kOrtSessionOptionsConfigDisableDmlGraphFusion, "1");

I experience the same issue with DirectML (inferencing whisper). Adding the following didn't helped me.

thewh1teagle avatar Jul 21 '24 01:07 thewh1teagle

I found the root cause of that issue. onnxruntime of directml depends on DirectML.dll and uses old DLL from system directory. The solution is to download newer DirectML.dll from nuget.org/packages/Microsoft.AI.DirectML Choose the button Download package. Unzip it and move the file DirectML.dll inside bin/x64-win folder into your executable folder

thewh1teagle avatar Jul 21 '24 14:07 thewh1teagle

@thewh1teagle, I think we are seeing a different problem. I directly create the DirectML device using LoadLibraryEx, making sure to load the DirectML.dll (1.13.1) associated with the onnxruntime (1.17.1) and use GetProcAddress to get the function DMLCreateDevice1 then call SessionOptionsAppendExecutionProvider_DML1 with the device I create.

siegelaaron94 avatar Jul 29 '24 16:07 siegelaaron94

SessionOptionsAppendExecutionProvider_DML1

@siegelaaron94 Yes. We got the same exception as "DmlGraphFusionHelper.cpp 80070057 The parameter is incorrect". After debug we found that it was IDMLDevive::CreateOperater, with DML_OPERATOR_GEMM operator type and the channel count of ATensor shape is bigger than 65535 the IDMLDevive::CreateOperater will return E_INVALIDARG (We test the input shape {65536, 1, 256}).

Any one known is that a limitation of 65535 channel count? Thanks.

hedecai avatar Oct 02 '24 08:10 hedecai

I also get this error but the error is not a consistent one. It occurs sometime for running the same application and doesn't on other times.

I use the onnxruntime and direcML dll produced after downloading the pip package pip install onnxruntime-directml

Let me know if anyone was able to figure the root cause and fix it.

spgoswami1 avatar Dec 17 '24 09:12 spgoswami1