onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[ORT GPU (DML)][WebNN] Edge Canary browser crashed when running two WebNN QDQ subgraph tests

Open BruceDai opened this issue 1 month ago • 7 comments

Describe the issue

Edge Canary browser crashed when running quantized leaky relu and quantized softmax two WebNN QDQ subgraph tests by ORT default GPU DML EP with the error

GpuProcessHost: The GPU process crashed! Exit code: STATUS_ACCESS_VIOLATION.

/cc @fdwr PTAL, thanks!

To reproduce

  1. Install Windows App SDK Stable 1.8.2 (1.8.251003001)
  2. Install lates Edge Canary browser
  3. Launch Edge Canary browser, navigate to about://flags enabled "Enables WebNN API" and "ONNX Runtime backend for WebNN" flags, then relaunch browser
  4. Navigate to below test url, crash happened
Test Test URL
quantized leaky relu https://wpt.live/webnn/conformance_tests/qdq_subgraph.https.any.html?device=gpu&tc=quantized%20leaky%20relu
quantized softmax https://wpt.live/webnn/conformance_tests/qdq_subgraph.https.any.html?device=gpu&tc=quantized%20softmax

Urgency

No response

ONNX Runtime Installation

Released Package Windows App SDK Stable 1.8.2 (1.8.251003001)

ONNX Runtime Version or Commit ID

1.23.25.928

Execution Provider

ORT default GPU EP

BruceDai avatar Nov 07 '25 13:11 BruceDai

I can reproduce the crash issue of the two cases , and share you some debugging info:

Image Image Image Image

mingmingtasd avatar Nov 10 '25 03:11 mingmingtasd

I built the dlls and pdbs by myself and got more detailed debugging info:

Image Image

The root cause is that DML EP can't find kernals for QLinearSoftmax and QLinearLeakyRelu.

/cc @fdwr

mingmingtasd avatar Nov 18 '25 06:11 mingmingtasd

The root cause is that DML EP can't find kernels for QLinearSoftmax and QLinearLeakyRelu.

That's some non-ONNX contrib op that DirectML doesn't know about (see operator registration). So the real questions are:

  • Why is an ORT transformer (possibly here?) blindly transforming an operator into something the EP doesn't support?
  • Why doesn't ORT realize this and fall back to another EP like it typically does?

fdwr avatar Nov 19 '25 02:11 fdwr

@fdwr Good questions!

And I verified if I apply --webnn-ort-graph-optimization-level=BASIC to override the graph optimization level of ONNX Runtime from default ENABLE_ALL to a lower level BASIC, the two crash tests can pass on DML EP.

Quick fix solution/workaround in chromium: So if the issue blocks us, we can consider applying BASIC level for DML EP as a workaround in chromium before the ORT fixing this issue. @BruceDai @huningxin

mingmingtasd avatar Nov 19 '25 07:11 mingmingtasd

/cc @adrastogi

huningxin avatar Dec 10 '25 03:12 huningxin

Trying to piece things together, could this be coming from the QDQ transformer?

https://github.com/microsoft/onnxruntime/blob/a83fc4d58cb48eb68890dd689f94f28288cf2278/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc#L151

Both the CPU and DML EPs are registered for LeakyRelu and Softmax, which are not supported based on what @fdwr was saying.- but end up getting assigned to the DML EP, which leads to the crash at runtime since they don't exist.

adrastogi avatar Dec 10 '25 22:12 adrastogi

Both the CPU and DML EPs are registered for LeakyRelu and Softmax, which are not supported

LeakyRelu and Softmax are registered by the DML EP. https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/OperatorRegistration.cpp#L1061-L1062. QLinearSoftmax and QLinearLeakyRelu are not though, and so those should not be assigned to the DML EP.

which leads to the crash at runtime since they don't exist

Yep.

fdwr avatar Dec 11 '25 00:12 fdwr