[ORT GPU (DML)][WebNN] Edge Canary browser crashed when running two WebNN QDQ subgraph tests
Describe the issue
Edge Canary browser crashed when running quantized leaky relu and quantized softmax two WebNN QDQ subgraph tests by ORT default GPU DML EP with the error
GpuProcessHost: The GPU process crashed! Exit code: STATUS_ACCESS_VIOLATION.
/cc @fdwr PTAL, thanks!
To reproduce
- Install Windows App SDK Stable 1.8.2 (1.8.251003001)
- Install lates Edge Canary browser
- Launch Edge Canary browser, navigate to
about://flagsenabled "Enables WebNN API" and "ONNX Runtime backend for WebNN" flags, then relaunch browser - Navigate to below test url, crash happened
| Test | Test URL |
|---|---|
| quantized leaky relu | https://wpt.live/webnn/conformance_tests/qdq_subgraph.https.any.html?device=gpu&tc=quantized%20leaky%20relu |
| quantized softmax | https://wpt.live/webnn/conformance_tests/qdq_subgraph.https.any.html?device=gpu&tc=quantized%20softmax |
Urgency
No response
ONNX Runtime Installation
Released Package Windows App SDK Stable 1.8.2 (1.8.251003001)
ONNX Runtime Version or Commit ID
1.23.25.928
Execution Provider
ORT default GPU EP
I can reproduce the crash issue of the two cases , and share you some debugging info:
I built the dlls and pdbs by myself and got more detailed debugging info:
The root cause is that DML EP can't find kernals for QLinearSoftmax and QLinearLeakyRelu.
/cc @fdwr
The root cause is that DML EP can't find kernels for QLinearSoftmax and QLinearLeakyRelu.
That's some non-ONNX contrib op that DirectML doesn't know about (see operator registration). So the real questions are:
- Why is an ORT transformer (possibly here?) blindly transforming an operator into something the EP doesn't support?
- Why doesn't ORT realize this and fall back to another EP like it typically does?
@fdwr Good questions!
And I verified if I apply --webnn-ort-graph-optimization-level=BASIC to override the graph optimization level of ONNX Runtime from default ENABLE_ALL to a lower level BASIC, the two crash tests can pass on DML EP.
Quick fix solution/workaround in chromium:
So if the issue blocks us, we can consider applying BASIC level for DML EP as a workaround in chromium before the ORT fixing this issue. @BruceDai @huningxin
/cc @adrastogi
Trying to piece things together, could this be coming from the QDQ transformer?
https://github.com/microsoft/onnxruntime/blob/a83fc4d58cb48eb68890dd689f94f28288cf2278/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc#L151
Both the CPU and DML EPs are registered for LeakyRelu and Softmax, which are not supported based on what @fdwr was saying.- but end up getting assigned to the DML EP, which leads to the crash at runtime since they don't exist.
Both the CPU and DML EPs are registered for LeakyRelu and Softmax, which are not supported
LeakyRelu and Softmax are registered by the DML EP. https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/OperatorRegistration.cpp#L1061-L1062. QLinearSoftmax and QLinearLeakyRelu are not though, and so those should not be assigned to the DML EP.
which leads to the crash at runtime since they don't exist
Yep.