onnxruntime [WebGPU] `Error: [WebGPU] Kernel "[Mul] /head/istft/Mul_1" failed. Error: Failed to generate kernel's output[0] with dims [1,3520,3520]. If you are running with pre-allocated output, please make sure the output type/dims are correct. Error: 81415528.`

Describe the issue

Unable to run https://huggingface.co/onnx-community/WavTokenizer-large-speech-75token_decode on WebGPU

[E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Mul node. Name:'/head/istft/Mul_1' Status Message: Failed to run JSEP kernel failed to inference ONNX model: Error: [WebGPU] Kernel "[Mul] /head/istft/Mul_1" failed. Error: Failed to generate kernel's output[0] with dims [1,3520,3520]. If you are running with pre-allocated output, please make sure the output type/dims are correct. Error: 81415528.

To reproduce

https://jsfiddle.net/Lq725aou/3/

Urgency

Blocks WebGPU for this demo: https://github.com/huggingface/transformers.js-examples/pull/17

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.20.1

Execution Provider

'webgpu' (WebGPU)

Dec 03 '24 17:12 xenova

#22997 is submitted for fixing the shader bug in Transpose. However, it's quite suspicious that the input of a Transpose node is a 1D tensor. Not sure if it's because of any error happened earlier.

Dec 03 '24 22:12 fs-eire

Unfortunately the error still persists: https://jsfiddle.net/gf7b3ck6/4/

Dec 07 '24 22:12 xenova

Reopen the issue. need further investigation.

Dec 08 '24 23:12 fs-eire

JS EP may handle NHWC incorrectly in this case. If the demo sets preferredLayout: NCHW, there is no error. I need more time to investigate the root-cause.

Dec 20 '24 09:12 xhcao

The issue also exists on CUDA EP when compiling with option --cmake_extra_defines onnxruntime_USE_CUDA_NHWC_OPS=ON, and it throws a different error message as below, Non-zero status code returned while running Transpose node. Name:'Transpose_token_154' Status Message: perm size: 3 does not match input rank: 1 But the root cause of JS EP and CUDA EP is the same, I think that the /head/istft/Squeeze_1 node in the model is not very correct.

From the spec https://onnx.ai/onnx/operators/onnx__Squeeze.html, the input shape of /head/istft/Squeeze_1 node is 3-D ([1,1,ConvTranspose_423_o0__d2]), the output shape should be 1-D ([ConvTranspose_423_o0__d2]). But its shape is still 3-D ([ConvTranspose_423_o0__d0,ConvTranspose_423_o0__d1,ConvTranspose_423_o0__d2]), which leads to error when making layout transform (NCHW -> NHWC) and Transpose optimization.

Let us take the input's shape(int64[batch_size,sequence_length]) is [1, 8] as an example. When using NCHW ops and calling session.initialize() to parse the model, the output shapes of nodes is shown as below, the shapes which parsing from model are all 3-D. -1 means unfixed value.

When calling session.run() to run the model, the output shapes of nodes is shown as below, some shapes are all 1-D for the /head/istft/Squeeze_1 node. But the dims' values are all one except the innermost dim, so the model could run correctly.

When using NHWC ops and session.initialize() to parse the model, after making layout transforms, we will traverse all nodes to apply Transpose optimization, the topology is shown as below before traversing Div node, and we could see that a Transpose node is pushed after ConvTranspose node,

After traversing Div node, and we could see that the Transpose node passes through the Div, and a new Transpose node is added after Where node, and perm of the new Transpose node is [0, 2, 1].

When using NHWC ops and session.run() to run the model, the output shapes of nodes is shown as below, some shapes are all 1-D for the /head/istft/Squeeze_1 node. Cuda EP reports the useful error message when running the new Transpose node. JS EP does not validate the input size and perm size of Transpose, and does nothing, so throw an error message until running Mul node for an error input shape [1, 3520, 3520], its expected shape is [1, 3520, 1]

In summary, I think that we must firstly modify the outputs dims' value of model nodes after the /head/istft/Squeeze_1 node. And onnxruntime core ensure that the Transpose node cannot pass through Div node and add a new Transpose node after Where node.

Dec 26 '24 06:12 xhcao

@jchen10 @hujiajie

Dec 26 '24 07:12 xhcao

Wow, great debugging @xhcao! I upgraded to the latest dev build @1.21.0-dev.20250114-228dd16893 (demo), but am now facing a different issue:

failed to inference ONNX model: Error: [WebGPU] Kernel "[Transpose] Transpose_token_194" failed. Error: perm size 3 does not match input rank 1.

Jan 17 '25 01:01 xenova

@xenova From my investigation, the main issue was the node /head/istft/Squeeze_1 of model, and you should change the model. In the model, the input shape is tensor: float32[1,1,ConvTranspose_423_o0__d2], and the output shape is float32[ConvTranspose_423_o0__d0,ConvTranspose_423_o0__d1,ConvTranspose_423_o0__d2], from the spec https://onnx.ai/onnx/operators/onnx__Squeeze.html, the output shape should be tensor: float32[ConvTranspose_423_o0__d2]

Feb 07 '25 01:02 xhcao

The model runs correctly on WASM though, so I would imagine this is still an issue with WebGPU? Perhaps this could be fixed by https://github.com/microsoft/onnxruntime/pull/23488?

Feb 07 '25 22:02 xenova

@xenova This model also failed on CUDA EP when enabling onnxruntime_USE_CUDA_NHWC_OPS=ON. I think it works NCHW layout on WASM EP. The default layout of JS EP is NHWC, if you set the layout to NCHW, the model also works correctly. The reason why could work correctly was shown on above comments.

Feb 08 '25 01:02 xhcao

bump

Aug 07 '25 16:08 xenova

Testing this on the latest native webgpu EP, the bug still remains:

2025-10-10 19:40:34.256 node[24803:54892007] 2025-10-10 19:40:34.256661 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Transpose node. Name:'Transpose_token_279' Status Message: perm size: 3 does not match input rank: 1
An error occurred during model execution: "Error: Non-zero status code returned while running Transpose node. Name:'Transpose_token_279' Status Message: perm size: 3 does not match input rank: 1".

cc @guschmue

Oct 10 '25 23:10 xenova

bump (still persists on latest webgpu ep)

Dec 11 '25 18:12 xenova