Whisper webgpu vs wasm performance
System Info
[email protected] chrome 127 macos
Environment/Platform
- [X] Website/web-app
- [ ] Browser extension
- [ ] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)
Description
Using [email protected]
I can see both whisper-speaker-diarization and whisper-webgpu demos using const dtype = {encoder_model:"fp32", decoder_model_merged:"q4"}
...while whisper-base.en_timestamped containing far more variants including fp16 etc.
However when using fp16:
-
{encoder_model:"fp32", decoder_model_merged:"fp16"}throws error115998696I have no idea what it means, fulltext search doesnt help either -
{encoder_model:"fp16", decoder_model_merged:"fp16"}throws error28388096 -
{encoder_model:"fp16", decoder_model_merged:"fp32"}no runtime exceptions, but the output text is very broken
I am puzzled about the performance on webgl/wasm configurations in general. Using Mac mini M2 I observe these transcribing times for 60 sec source, notice wasm is faster than webgpu
- fp32+q4/webgpu -> 9.5sec
- fp32+q4/wasm -> 5.9sec
- fp32+fp32/webgpu -> 9.6sec
- fp32+fp32/wasm -> 4.9sec
- q8+q8/webgpu -> 27sec
- q8+q8/wasm -> 5.2sec
Reproduction
const pipe = await pipeline("automatic-speech-recognition", "onnx-community/whisper-base_timestamped",
{"dtype": {"encoder_model": "fp32","decoder_model_merged": "q4"}, "device": "webgpu"})
const result = await pipe(audio, {chunk_length_s: 30, stride_length_s: 5, return_timestamps: "word", language: "en"});
throw error can be reproduced at my side, and I will look into this
-
{encoder_model:"fp16", decoder_model_merged:"fp16"}
- wasm: Uncaught 272174096
- webgpu: Uncaught 27884576
-
{encoder_model:"fp32", decoder_model_merged:"fp16"}
- wasm: Uncaught 101271848
- webgpu: Uncaught 153346408
-
{"encoder_model": "fp16", "decoder_model_merged": "fp32"},
- wasm: I have a dream that one day this nation will rise up and live out the true meaning of its creed
- webgpu: I am a the the the the the the the the the the th…e the the the the the the the the the the the the
perf of webgpu is worse than wasm can not be reproduce at my side
I tried fp32+q4 on Intel UHD630/i7-9700, on which webgpu is almost twice faster than wasm:
- webgpu: Time taken: 5882.016845703125 ms
- wasm: Time taken: 10715.281982421875 ms
Could you please double check below code about perf?
import { read_audio, pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
function getParameter(name) {
const urlParams = new URLSearchParams(window.location.search);
return urlParams.get(name);
}
const backend = getParameter('backend');
console.log(backend);
let audio = await read_audio('https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac', 16000);
const pipe = await pipeline("automatic-speech-recognition", "onnx-community/whisper-base_timestamped",
{ "dtype": { "encoder_model": "fp32", "decoder_model_merged": "q4" }, "device": backend })
console.time("Time taken");
const result = await pipe(audio, { chunk_length_s: 30, stride_length_s: 5, return_timestamps: "word", language: "en" });
console.timeEnd("Time taken");
console.log(result);
@axinging I tried your code on mac mini m2 on chrome 128.0.6613.114 with the following results: wasm Time taken: 2440.335205078125 ms webgpu Time taken: 3063.0830078125 ms
This also confirms it is faster to use wasm. Any ideas why?
I think we need to confirm if this really runs on webgpu. Can you start chrome with flag like: "--enable-dawn-features=allow_unsafe_apis,use_dxc,dump_shaders --enable-features=SharedArrayBuffer".
Then check the console, see if there any Dumped WGSL code like below:
// Dumped WGSL:
enable f16;
struct Uniforms { output_size:u32, a_shape:vec3<u32>, a_strides:vec3<u32>, output_shape:vec3<u32>, output_strides:vec3<u32> };
@group(0) @binding(2) var<uniform> uniforms: Uniforms;
fn i2o_a(indices: vec3<u32>) -> u32 {
return uniforms.a_strides[2] * (indices[2])+uniforms.a_strides[1] * (indices[1])+uniforms.a_strides[0] * (indices[0]);
If you can see "Dumped WGSL" in console. then it means this run in webgpu, otherwise not.
Using the code provided and args provided, I can see many logs like this in console
// Dumped WGSL:
enable f16;
Interestingly, after updating to Chrome 128.0.6613.120
- 3.0.0-alpha.6 webgpu: Time taken: 3330.947021484375 ms
- 3.0.0-alpha.6 wasm: Time taken: 3658.69287109375 ms
- 3.0.0-alpha.14 webgpu: Time taken: 1798.175048828125 ms
- 3.0.0-alpha.14 wasm: Time taken: 3434.06689453125 ms
This is still an issue with v3.0.2, macos 15.1 (mac mini m2), chrome 130, webgpu available. For 60 second source (pcm), wasm finishes in 6.2 sec, webgpu in 8.2 sec.
index.src.js
import { env, pipeline } from "@huggingface/transformers";
if(self.document) {
env.allowLocalModels = false;
const buffer = await (await fetch("sample.pcm")).arrayBuffer();
const device = "webgpu";
const t0 = performance.now();
const audio = new Float32Array(buffer);
const pipe = await pipeline("automatic-speech-recognition", "onnx-community/whisper-base",
{dtype:{encoder_model:"fp32", decoder_model_merged:"q4"}, device});
const result = await pipe(audio, {chunk_length_s:30, stride_length_s:5, return_timestamps:true, language:"en"});
document.body.append(`${device} finished in ${performance.now() - t0}`);
}
index.html
<script src="index.js" type="module"></script>
build:
npm install --save-exact --save-dev esbuild
npm install @huggingface/[email protected]
./node_modules/.bin/esbuild index.src.js --bundle --format=esm --target=esnext --outfile=index.js
Any ideas, why is webgpu slower than wasm?