transformers.js icon indicating copy to clipboard operation
transformers.js copied to clipboard

Whisper on webGPU?

Open sandorkonya opened this issue 1 year ago • 16 comments

Somewhat related to this thread.

Is it within scope to implement a webGPU accelerated version of Whisper?

Not sure if this helps, but there is a C port for Whisper wirh CPU implementation, and as mentioned in this discussion, the main thing that needs to be offloaded to the GPU is the GGML_OP_MUL_MAT operator.

sandorkonya avatar Apr 25 '23 09:04 sandorkonya

Is it within scope to implement a webGPU accelerated version of Whisper?

As I understand, it's simply a matter of changing the Execution provider now to JSEP. The C++ port uses GGML format for the model and this repo uses onnx models alongside onnxruntime to run infrence. Both implementations are different. And with the WebGPU support for onnxruntime (check this PR: [js/web] WebGPU backend via JSEP #14579) which was merged today and official release build will come soon enough, I believe we don't have to worry about CUDA or DirectML endpoints, JSEP does the work for us. It's only a matter of updating the onnxruntime dependency and using JSEP for execution provider.

@xenova correct me if I'm wrong.

DK013 avatar Apr 25 '23 11:04 DK013

Is it within scope to implement a webGPU accelerated version of Whisper?

As I understand, it's simply a matter of changing the Execution provider now to JSEP. The C++ port uses GGML format for the model and this repo uses onnx models alongside onnxruntime to run infrence. Both implementations are different. And with the WebGPU support for onnxruntime (check this PR: [js/web] WebGPU backend via JSEP #14579) which was merged today and official release build will come soon enough, I believe we don't have to worry about CUDA or DirectML endpoints, JSEP does the work for us. It's only a matter of updating the onnxruntime dependency and using JSEP for execution provider.

@xenova correct me if I'm wrong.

Yep, that's correct! It should be as simple as changing the execution provided to webgpu (vs. wasm)

Hopefully they will make the release soon, but in the meantime, I'll do some testing by building the main branch locally.

xenova avatar Apr 25 '23 12:04 xenova

@DK013 & @xenova thank you for the clarification!

I would like to find a way to utilize the GPUs on edge devices (Android mobile) for inference.

As far i understand (as for now) webGPU works on Windows & iOS (my assumption based on this blog post), so we have to wait until webGPU targets the Android devices too?

Or am I simply wrong and onnxruntime won't be the way for edge devices?

best regards

sandorkonya avatar Apr 25 '23 13:04 sandorkonya

Yes, you are correct. WebGPU would need to be available in your browser, as onnxruntime just uses the api provided by the browser.

That said, you might not have to wait for very long. As stated in the blog post you linked: "This initial release of WebGPU is available on ChromeOS, macOS, and Windows. Support for other platforms is coming later this year." If you'd like to test while you develop (so you can be ready when it releases fully), you can test using Chrome canary. As demoed here, some users have already got webgpu running on their android devices with this browser (which is just an experimental version of chrome)

xenova avatar Apr 25 '23 13:04 xenova

@xenova how we can use gpu power when we use nodejs ?

i try to build a local server with node, all works but very slow on an AMD 5950X , i would like to use my RTX 4070TI to transcribe but i couldnt find any document that talks about it

drcodecamp avatar Sep 28 '23 07:09 drcodecamp

@xenova, are there any news? Will we be allowed to use webgpu with transformers.js any time soon?

Dolidodzik avatar Nov 22 '23 15:11 Dolidodzik

AFAIU onnx runtime's support for WebGPU is still pretty minimal/experimental, so likely isn't able to run Whisper today

Overview issue is here: https://github.com/microsoft/onnxruntime/issues/15796

There doesn't seem to be much up-to-date detailed documentation about the current status publicly available, but as of May many operators were still yet to be ported: https://github.com/microsoft/onnxruntime/issues/15952

gabrielgrant avatar Nov 28 '23 18:11 gabrielgrant

ort-web on webgpu has now good ops coverage and we can run most models that transformers.js supports. whisper is fine, it is part of our test suite. The reason why we have not been more public about it is that we still have a performance issue with generative decoders that go 1 token at a time (ie whisper decoder, t5-decoder). We are debugging that, don't know what the cause is but we are sure it is not the shaders. All encoders and vision models should have good perf gains. Supported ops can be found here: https://github.com/microsoft/onnxruntime/blob/main/js/web/docs/webgpu-operators.md

guschmue avatar Dec 07 '23 22:12 guschmue

thanks for the update @guschmue !

Is there a GH issue for the problem you're describing? Is it this? https://github.com/microsoft/onnxruntime/issues/17373

gabrielgrant avatar Dec 08 '23 00:12 gabrielgrant

That issue contains a couple of problems, like missing ops resulted in cross device copies and missing io-bindings resulted in a lot of cross device copies. I think we fix most of those. But this decoder issue has been in this too. Ie the io-bindings should have gained much more than they did. Nasty issue, lots of gpu cycle available, kernel times look good, little cross device copy yet 2x slower than we want. Top of our list. I can file a separate issue.

guschmue avatar Dec 08 '23 01:12 guschmue

https://github.com/microsoft/onnxruntime/issues/18754

guschmue avatar Dec 08 '23 01:12 guschmue

What about node.js? Will webGPU/GPU acceleration be available on server/desktop side w/o browser?

gokaybiz avatar Dec 08 '23 14:12 gokaybiz

@xenova I am curious to try. Do you have builds with WebGPU ?

I've built onnxruntime with the jsep option but I am not entirely sure what are the spots to change in transformers.js - is it as simple as executionProviders to ort.InferenceSession.create ?

tarekziade avatar Dec 17 '23 14:12 tarekziade

Additionally another optimization should be done: STFT

DavidGOrtega avatar Jan 16 '24 10:01 DavidGOrtega

For anyone coming here who didn't see it yet, there is webGPU support now thanks to Xenova's efforts described here

Code in this branch: https://github.com/xenova/whisper-web/tree/experimental-webgpu

nmstoker avatar Jun 09 '24 20:06 nmstoker

What about node.js? Will webGPU/GPU acceleration be available on server/desktop side w/o browser?

There is some experimental code path in dawn that one could use to make onnxruntime work with webgpu on node.js. But we are not sure if people would use that path since onnxruntime-node already supports cuda and directml which is faster than webgpu.

guschmue avatar Jun 10 '24 16:06 guschmue