transformers.js Whisper webgpu vs wasm performance

System Info

Environment/Platform

[X] Website/web-app
[ ] Browser extension
[ ] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

I can see both whisper-speaker-diarization and whisper-webgpu demos using const dtype = {encoder_model:"fp32", decoder_model_merged:"q4"}

...while whisper-base.en_timestamped containing far more variants including fp16 etc.

However when using fp16:

{encoder_model:"fp32", decoder_model_merged:"fp16"} throws error 115998696 I have no idea what it means, fulltext search doesnt help either
{encoder_model:"fp16", decoder_model_merged:"fp16"} throws error 28388096
{encoder_model:"fp16", decoder_model_merged:"fp32"} no runtime exceptions, but the output text is very broken

I am puzzled about the performance on webgl/wasm configurations in general. Using Mac mini M2 I observe these transcribing times for 60 sec source, notice wasm is faster than webgpu

fp32+q4/webgpu -> 9.5sec
fp32+q4/wasm -> 5.9sec
fp32+fp32/webgpu -> 9.6sec
fp32+fp32/wasm -> 4.9sec
q8+q8/webgpu -> 27sec
q8+q8/wasm -> 5.2sec

Reproduction

const pipe = await pipeline("automatic-speech-recognition", "onnx-community/whisper-base_timestamped", 
    {"dtype": {"encoder_model": "fp32","decoder_model_merged": "q4"}, "device": "webgpu"})
const result = await pipe(audio, {chunk_length_s: 30, stride_length_s: 5, return_timestamps: "word", language: "en"});

Aug 19 '24 08:08 jozefchutka

throw error can be reproduced at my side, and I will look into this

{encoder_model:"fp16", decoder_model_merged:"fp16"}
- wasm: Uncaught 272174096
- webgpu: Uncaught 27884576
{encoder_model:"fp32", decoder_model_merged:"fp16"}
- wasm: Uncaught 101271848
- webgpu: Uncaught 153346408
{"encoder_model": "fp16", "decoder_model_merged": "fp32"},
- wasm: I have a dream that one day this nation will rise up and live out the true meaning of its creed
- webgpu: I am a the the the the the the the the the the th…e the the the the the the the the the the the the

perf of webgpu is worse than wasm can not be reproduce at my side

I tried fp32+q4 on Intel UHD630/i7-9700, on which webgpu is almost twice faster than wasm:

webgpu: Time taken: 5882.016845703125 ms
wasm: Time taken: 10715.281982421875 ms

Could you please double check below code about perf?

import { read_audio, pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
function getParameter(name) {
    const urlParams = new URLSearchParams(window.location.search);
    return urlParams.get(name);
}
const backend = getParameter('backend');
console.log(backend);
let audio = await read_audio('https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac', 16000);
const pipe = await pipeline("automatic-speech-recognition", "onnx-community/whisper-base_timestamped",
    { "dtype": { "encoder_model": "fp32", "decoder_model_merged": "q4" }, "device": backend })
console.time("Time taken");
const result = await pipe(audio, { chunk_length_s: 30, stride_length_s: 5, return_timestamps: "word", language: "en" });
console.timeEnd("Time taken");
console.log(result);

Aug 27 '24 07:08 axinging

@axinging I tried your code on mac mini m2 on chrome 128.0.6613.114 with the following results: wasm Time taken: 2440.335205078125 ms webgpu Time taken: 3063.0830078125 ms

This also confirms it is faster to use wasm. Any ideas why?

Sep 02 '24 06:09 jozefchutka

I think we need to confirm if this really runs on webgpu. Can you start chrome with flag like: "--enable-dawn-features=allow_unsafe_apis,use_dxc,dump_shaders --enable-features=SharedArrayBuffer".

Then check the console, see if there any Dumped WGSL code like below:

// Dumped WGSL:
enable f16;
struct Uniforms { output_size:u32, a_shape:vec3<u32>, a_strides:vec3<u32>, output_shape:vec3<u32>, output_strides:vec3<u32> };
      @group(0) @binding(2) var<uniform> uniforms: Uniforms;
  fn i2o_a(indices: vec3<u32>) -> u32 {
    return uniforms.a_strides[2] * (indices[2])+uniforms.a_strides[1] * (indices[1])+uniforms.a_strides[0] * (indices[0]);

If you can see "Dumped WGSL" in console. then it means this run in webgpu, otherwise not.

Sep 03 '24 07:09 axinging

Using the code provided and args provided, I can see many logs like this in console

// Dumped WGSL:
enable f16;

Interestingly, after updating to Chrome 128.0.6613.120

3.0.0-alpha.6 webgpu: Time taken: 3330.947021484375 ms
3.0.0-alpha.6 wasm: Time taken: 3658.69287109375 ms
3.0.0-alpha.14 webgpu: Time taken: 1798.175048828125 ms
3.0.0-alpha.14 wasm: Time taken: 3434.06689453125 ms

Sep 03 '24 08:09 jozefchutka

This is still an issue with v3.0.2, macos 15.1 (mac mini m2), chrome 130, webgpu available. For 60 second source (pcm), wasm finishes in 6.2 sec, webgpu in 8.2 sec.

index.src.js

import { env, pipeline } from "@huggingface/transformers";

if(self.document) {
	env.allowLocalModels = false;

	const buffer = await (await fetch("sample.pcm")).arrayBuffer();
	const device = "webgpu";

	const t0 = performance.now();
	const audio = new Float32Array(buffer);
	const pipe = await pipeline("automatic-speech-recognition", "onnx-community/whisper-base",
		{dtype:{encoder_model:"fp32", decoder_model_merged:"q4"}, device});
	const result = await pipe(audio, {chunk_length_s:30, stride_length_s:5, return_timestamps:true, language:"en"});
	document.body.append(`${device} finished in ${performance.now() - t0}`);
}

index.html

<script src="index.js" type="module"></script>

build:

npm install --save-exact --save-dev esbuild
npm install @huggingface/[email protected]
./node_modules/.bin/esbuild index.src.js --bundle --format=esm --target=esnext --outfile=index.js

Any ideas, why is webgpu slower than wasm?

Nov 07 '24 12:11 jozefchutka