[WebGPU] `Error: [WebGPU] Kernel "[Mul] /head/istft/Mul_1" failed. Error: Failed to generate kernel's output[0] with dims [1,3520,3520]. If you are running with pre-allocated output, please make sure the output type/dims are correct. Error: 81415528.`
Describe the issue
Unable to run https://huggingface.co/onnx-community/WavTokenizer-large-speech-75token_decode on WebGPU
[E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Mul node. Name:'/head/istft/Mul_1' Status Message: Failed to run JSEP kernel failed to inference ONNX model: Error: [WebGPU] Kernel "[Mul] /head/istft/Mul_1" failed. Error: Failed to generate kernel's output[0] with dims [1,3520,3520]. If you are running with pre-allocated output, please make sure the output type/dims are correct. Error: 81415528.
To reproduce
https://jsfiddle.net/Lq725aou/3/
Urgency
Blocks WebGPU for this demo: https://github.com/huggingface/transformers.js-examples/pull/17
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.20.1
Execution Provider
'webgpu' (WebGPU)
#22997 is submitted for fixing the shader bug in Transpose. However, it's quite suspicious that the input of a Transpose node is a 1D tensor. Not sure if it's because of any error happened earlier.
Unfortunately the error still persists: https://jsfiddle.net/gf7b3ck6/4/
Reopen the issue. need further investigation.
JS EP may handle NHWC incorrectly in this case. If the demo sets preferredLayout: NCHW, there is no error.
I need more time to investigate the root-cause.
The issue also exists on CUDA EP when compiling with option --cmake_extra_defines onnxruntime_USE_CUDA_NHWC_OPS=ON, and it throws a different error message as below,
Non-zero status code returned while running Transpose node. Name:'Transpose_token_154' Status Message: perm size: 3 does not match input rank: 1
But the root cause of JS EP and CUDA EP is the same, I think that the /head/istft/Squeeze_1 node in the model is not very correct.
From the spec https://onnx.ai/onnx/operators/onnx__Squeeze.html, the input shape of /head/istft/Squeeze_1 node is 3-D ([1,1,ConvTranspose_423_o0__d2]), the output shape should be 1-D ([ConvTranspose_423_o0__d2]). But its shape is still 3-D ([ConvTranspose_423_o0__d0,ConvTranspose_423_o0__d1,ConvTranspose_423_o0__d2]), which leads to error when making layout transform (NCHW -> NHWC) and Transpose optimization.
Let us take the input's shape(int64[batch_size,sequence_length]) is [1, 8] as an example. When using NCHW ops and calling session.initialize() to parse the model, the output shapes of nodes is shown as below, the shapes which parsing from model are all 3-D. -1 means unfixed value.
When calling session.run() to run the model, the output shapes of nodes is shown as below, some shapes are all 1-D for the /head/istft/Squeeze_1 node.
But the dims' values are all one except the innermost dim, so the model could run correctly.
When using NHWC ops and session.initialize() to parse the model, after making layout transforms, we will traverse all nodes to apply Transpose optimization, the topology is shown as below before traversing Div node, and we could see that a Transpose node is pushed after ConvTranspose node,
After traversing Div node, and we could see that the Transpose node passes through the Div, and a new Transpose node is added after Where node, and perm of the new Transpose node is [0, 2, 1].
When using NHWC ops and session.run() to run the model, the output shapes of nodes is shown as below, some shapes are all 1-D for the /head/istft/Squeeze_1 node.
Cuda EP reports the useful error message when running the new Transpose node.
JS EP does not validate the input size and perm size of Transpose, and does nothing, so throw an error message until running
Mul node for an error input shape [1, 3520, 3520], its expected shape is [1, 3520, 1]
In summary, I think that we must firstly modify the outputs dims' value of model nodes after the /head/istft/Squeeze_1 node. And onnxruntime core ensure that the Transpose node cannot pass through Div node and add a new Transpose node after Where node.
@jchen10 @hujiajie
Wow, great debugging @xhcao! I upgraded to the latest dev build @1.21.0-dev.20250114-228dd16893 (demo), but am now facing a different issue:
failed to inference ONNX model: Error: [WebGPU] Kernel "[Transpose] Transpose_token_194" failed. Error: perm size 3 does not match input rank 1.
@xenova From my investigation, the main issue was the node /head/istft/Squeeze_1 of model, and you should change the model. In the model, the input shape is tensor: float32[1,1,ConvTranspose_423_o0__d2], and the output shape is float32[ConvTranspose_423_o0__d0,ConvTranspose_423_o0__d1,ConvTranspose_423_o0__d2], from the spec https://onnx.ai/onnx/operators/onnx__Squeeze.html, the output shape should be tensor: float32[ConvTranspose_423_o0__d2]
The model runs correctly on WASM though, so I would imagine this is still an issue with WebGPU? Perhaps this could be fixed by https://github.com/microsoft/onnxruntime/pull/23488?
@xenova This model also failed on CUDA EP when enabling onnxruntime_USE_CUDA_NHWC_OPS=ON. I think it works NCHW layout on WASM EP. The default layout of JS EP is NHWC, if you set the layout to NCHW, the model also works correctly. The reason why could work correctly was shown on above comments.
bump
Testing this on the latest native webgpu EP, the bug still remains:
2025-10-10 19:40:34.256 node[24803:54892007] 2025-10-10 19:40:34.256661 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Transpose node. Name:'Transpose_token_279' Status Message: perm size: 3 does not match input rank: 1
An error occurred during model execution: "Error: Non-zero status code returned while running Transpose node. Name:'Transpose_token_279' Status Message: perm size: 3 does not match input rank: 1".
cc @guschmue
bump (still persists on latest webgpu ep)