onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[WebGPU] `Error: [WebGPU] Kernel "[Mul] /head/istft/Mul_1" failed. Error: Failed to generate kernel's output[0] with dims [1,3520,3520]. If you are running with pre-allocated output, please make sure the output type/dims are correct. Error: 81415528.`

Open xenova opened this issue 1 year ago • 13 comments

Describe the issue

Unable to run https://huggingface.co/onnx-community/WavTokenizer-large-speech-75token_decode on WebGPU

[E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Mul node. Name:'/head/istft/Mul_1' Status Message: Failed to run JSEP kernel failed to inference ONNX model: Error: [WebGPU] Kernel "[Mul] /head/istft/Mul_1" failed. Error: Failed to generate kernel's output[0] with dims [1,3520,3520]. If you are running with pre-allocated output, please make sure the output type/dims are correct. Error: 81415528.

Image

To reproduce

https://jsfiddle.net/Lq725aou/3/

Urgency

Blocks WebGPU for this demo: https://github.com/huggingface/transformers.js-examples/pull/17

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.20.1

Execution Provider

'webgpu' (WebGPU)

xenova avatar Dec 03 '24 17:12 xenova

#22997 is submitted for fixing the shader bug in Transpose. However, it's quite suspicious that the input of a Transpose node is a 1D tensor. Not sure if it's because of any error happened earlier.

fs-eire avatar Dec 03 '24 22:12 fs-eire

Unfortunately the error still persists: https://jsfiddle.net/gf7b3ck6/4/

xenova avatar Dec 07 '24 22:12 xenova

Reopen the issue. need further investigation.

fs-eire avatar Dec 08 '24 23:12 fs-eire

JS EP may handle NHWC incorrectly in this case. If the demo sets preferredLayout: NCHW, there is no error. I need more time to investigate the root-cause.

xhcao avatar Dec 20 '24 09:12 xhcao

The issue also exists on CUDA EP when compiling with option --cmake_extra_defines onnxruntime_USE_CUDA_NHWC_OPS=ON, and it throws a different error message as below, Non-zero status code returned while running Transpose node. Name:'Transpose_token_154' Status Message: perm size: 3 does not match input rank: 1 But the root cause of JS EP and CUDA EP is the same, I think that the /head/istft/Squeeze_1 node in the model is not very correct.

From the spec https://onnx.ai/onnx/operators/onnx__Squeeze.html, the input shape of /head/istft/Squeeze_1 node is 3-D ([1,1,ConvTranspose_423_o0__d2]), the output shape should be 1-D ([ConvTranspose_423_o0__d2]). But its shape is still 3-D ([ConvTranspose_423_o0__d0,ConvTranspose_423_o0__d1,ConvTranspose_423_o0__d2]), which leads to error when making layout transform (NCHW -> NHWC) and Transpose optimization. Image

Let us take the input's shape(int64[batch_size,sequence_length]) is [1, 8] as an example. When using NCHW ops and calling session.initialize() to parse the model, the output shapes of nodes is shown as below, the shapes which parsing from model are all 3-D. -1 means unfixed value. Image

When calling session.run() to run the model, the output shapes of nodes is shown as below, some shapes are all 1-D for the /head/istft/Squeeze_1 node. Image But the dims' values are all one except the innermost dim, so the model could run correctly.

When using NHWC ops and session.initialize() to parse the model, after making layout transforms, we will traverse all nodes to apply Transpose optimization, the topology is shown as below before traversing Div node, and we could see that a Transpose node is pushed after ConvTranspose node, Image

After traversing Div node, and we could see that the Transpose node passes through the Div, and a new Transpose node is added after Where node, and perm of the new Transpose node is [0, 2, 1]. Image

When using NHWC ops and session.run() to run the model, the output shapes of nodes is shown as below, some shapes are all 1-D for the /head/istft/Squeeze_1 node. Image Cuda EP reports the useful error message when running the new Transpose node. JS EP does not validate the input size and perm size of Transpose, and does nothing, so throw an error message until running Mul node for an error input shape [1, 3520, 3520], its expected shape is [1, 3520, 1]

In summary, I think that we must firstly modify the outputs dims' value of model nodes after the /head/istft/Squeeze_1 node. And onnxruntime core ensure that the Transpose node cannot pass through Div node and add a new Transpose node after Where node.

xhcao avatar Dec 26 '24 06:12 xhcao

@jchen10 @hujiajie

xhcao avatar Dec 26 '24 07:12 xhcao

Wow, great debugging @xhcao! I upgraded to the latest dev build @1.21.0-dev.20250114-228dd16893 (demo), but am now facing a different issue:

failed to inference ONNX model: Error: [WebGPU] Kernel "[Transpose] Transpose_token_194" failed. Error: perm size 3 does not match input rank 1.

Image

xenova avatar Jan 17 '25 01:01 xenova

@xenova From my investigation, the main issue was the node /head/istft/Squeeze_1 of model, and you should change the model. In the model, the input shape is tensor: float32[1,1,ConvTranspose_423_o0__d2], and the output shape is float32[ConvTranspose_423_o0__d0,ConvTranspose_423_o0__d1,ConvTranspose_423_o0__d2], from the spec https://onnx.ai/onnx/operators/onnx__Squeeze.html, the output shape should be tensor: float32[ConvTranspose_423_o0__d2]

xhcao avatar Feb 07 '25 01:02 xhcao

The model runs correctly on WASM though, so I would imagine this is still an issue with WebGPU? Perhaps this could be fixed by https://github.com/microsoft/onnxruntime/pull/23488?

xenova avatar Feb 07 '25 22:02 xenova

@xenova This model also failed on CUDA EP when enabling onnxruntime_USE_CUDA_NHWC_OPS=ON. I think it works NCHW layout on WASM EP. The default layout of JS EP is NHWC, if you set the layout to NCHW, the model also works correctly. The reason why could work correctly was shown on above comments.

xhcao avatar Feb 08 '25 01:02 xhcao

bump

xenova avatar Aug 07 '25 16:08 xenova

Testing this on the latest native webgpu EP, the bug still remains:

2025-10-10 19:40:34.256 node[24803:54892007] 2025-10-10 19:40:34.256661 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Transpose node. Name:'Transpose_token_279' Status Message: perm size: 3 does not match input rank: 1
An error occurred during model execution: "Error: Non-zero status code returned while running Transpose node. Name:'Transpose_token_279' Status Message: perm size: 3 does not match input rank: 1".

cc @guschmue

xenova avatar Oct 10 '25 23:10 xenova

bump (still persists on latest webgpu ep)

xenova avatar Dec 11 '25 18:12 xenova