candle icon indicating copy to clipboard operation
candle copied to clipboard

WebGPU support

Open sluijs opened this issue 2 years ago • 30 comments

Is WebGPU support on the roadmap as an alternative GPU-accelerated backend? This would be especially useful for inference on the web or for non-CUDA environments.

sluijs avatar Aug 08 '23 15:08 sluijs

WebGPU is certainly on our radar, we already have some wasm based demos for llama2.c and whisper that you can try in a web browser. When using wasm, candle should leverage your cpu simd instructions, but having WebGPU on top of this would bring it to a far better level.

LaurentMazare avatar Aug 08 '23 16:08 LaurentMazare

And if/when candle adds WebGPU support, I'll add it as a backend to Transformers.js! 🚀 Really exciting times! 🔥

xenova avatar Aug 08 '23 16:08 xenova

Hi! I'd be interested in working on this. I've spent some time thinking about a rough plan after reading through the code:

  1. Move candle-kernels to candle-cuda or candle-cuda-kernels (the name can be bikeshed'd in the PR)
  2. Make a candle-wgsl(-kernels) crate, with kernels implementing the ops needed. Can maybe re-use some implementations based on https://github.com/webonnx/wonnx
  3. Add a new backend implementation using wgpu-rs to execute the kernels
  4. Add tests, info on how to run things
  5. Maybe add flash attention kernels -- might be a lot of work so probably worth its own follow-up issue.

Some questions:

  • Is wrapping with wgpu-rs acceptable? I see that the cuda wrappers use cudarc, and wgpu-rs seems to be the closest equivalent for wgsl shaders for compute. It would also be an easy way to bridge things on native to Vulkan and Metal as well (would take more work though, e.g. integrating naga).
  • Do you want all kernels in one crate? In the above I suggest splitting them into their own backend-specific crates, but I guess that could be abstracted into the candle-kernels crate itself if you prefer.
  • I will probably focus on the web for now, because I think it is the most promising, but I also will probably follow up with testing Vulkan, as I think that will make inference much more portable.

emmatyping avatar Aug 09 '23 07:08 emmatyping

Sounds like a very reasonable plan.

I think we can start working without tying too much to candle, maybe other projects could be interested on having webgpu support (that's why having cudarc is great it can be used by other projects, not necessarily candle, and why we keep pushing changes upstream as much as possible, like the NCCL support).

wgpu-rs : Last time I tried Vulkan, and doing compute shaders, the performance was abysmal. And it makes sense, it's not really designed for ML. In general I would go for the most performant solutions from the start, not have backends just for the sake of it. AMD already has libraries intended for ML: https://www.amd.com/en/graphics/servers-solutions-rocm we could link to that directly if it makes more sense.

For Metal I would have the same opinion, we should try and make metal usable outside of this crate and be mere users of it.

For any new backend, it is very important to create a way for USERS to create their own kernel/op. It's impossible to keep up with all the innovation imho so the most important thing is to allow users of candle to use any op they want, without having to wait for us to implement it.

Narsil avatar Aug 09 '23 08:08 Narsil

re wgpu-rs, I certainly agree that native backends are the best, I only bring up Vulkan/Metal as bonuses. I was suggesting wgpu-rs because it is the major WebGPU library for Rust, it looks like Burn uses it. So I think it is the best library for the job, I just wanted to see if adding the dependency was acceptable. The alternative would be to write a bunch of bindings via web-sys around webgpu APIs.

For any new backend, it is very important to create a way for USERS to create their own kernel/op.

Certainly! I mostly was discussing the crate rename/split focused on candle-provided kernels. For user written kernels, would it not be best to simply add wgpu_fwd to the Op{N} traits that the user may implement? Are there other details I should be aware of?

emmatyping avatar Aug 09 '23 09:08 emmatyping

Certainly! I mostly was discussing the crate rename/split focused on candle-provided kernels. For user written kernels, would it not be best to simply add wgpu_fwd to the Op{N} traits that the user may implement? Are there other details I should be aware of?

Basically yes. Tensor is Send+Sync, therefore Op needs to be Send+Sync (because it's kept for gradients). That could end up being a limitation: https://github.com/huggingface/candle/blob/main/candle-examples/examples/llama_multiprocess/model.rs#L33-L38

I think it is the best library for the job

What other libraries or alternatives are there ? Looking at this: https://www.reddit.com/r/rust/comments/159cbto/announcing_burnwgpu_new_deep_learning/jtf80xq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button I have the feeling like it's not the correct way. We need only webGPU, not those 10 other things. In any case this is not in our short term roadmap.

Narsil avatar Aug 09 '23 11:08 Narsil

Basically yes. Tensor is Send+Sync

Good news there, WebAssembly doesn't have OS-style threads! The webworkers-based "threads" might require things to be Send/Sync, but I will have to look closer at that.

What other libraries or alternatives are there ?

Honestly, I didn't find any that seemed currently maintained or more than toys.

We need only webGPU, not those 10 other things.

Yeah, its possible that wgpu isn't the right project, it is pretty large, but the other part is those other features are optional, so I don't know how much it hurts to include it.

In any case this is not in our short term roadmap.

Fair enough!

emmatyping avatar Aug 09 '23 11:08 emmatyping

is now support webgpu ?

helloburke avatar Sep 18 '23 06:09 helloburke

candle webassembly Is there any plan to support WebGPU?

guyoung avatar Oct 18 '23 04:10 guyoung

One general comment.

Move candle-kernels to candle-cuda or candle-cuda-kernels (the name can be bikeshed'd in the PR)

I feel like writing those compute shaders in glsl might be a better option. I have done some rough testing on different gpgpu performance and vulkan with glsl seems to be able to keep up with cuda while wgpu with wgsl reaches bottleneck pretty early with the same optimization tricks. On top of that, webgpu supports glsl as well, so we could have not only a webgpu backend but a vulkan one as well (I guess for folks who still want to run it natively by don't have the luxury of a nvidia GPU but an intel/AMD GPU)

minghuaw avatar Nov 12 '23 08:11 minghuaw

@LaurentMazare are there any plans to start implementing a WebGPU backend? I see Ratchet has successfully implemented WebGPU inference and would love to see this in Candle soon as well. I would love to help with this implementation if possible too, but there's a lot of learning on my part to be done before I do so, so ideally would love to chat more since I'd have a lot of questions.

santiagomed avatar Jun 13 '24 02:06 santiagomed

https://github.com/cryscan/web-rwkv Here is a RWKV LLM inference based on WGPU Vulkan。

cgisky1980 avatar Jun 20 '24 01:06 cgisky1980

Maybe this fork will help if WebGpu support is requested: As part of a university project this semester, I developed a candle fork with a WGPU backend. (https://github.com/KimHenrikOtte/candle/tree/wgpu_cleanup) I created implementations for most of the internal operations. Currently only sorting and a few operations in candle-nn are not implemented.

In the end, I managed to create an image in the browser using a newly created wuerstchen wasm example. (I don't think this would be possible otherwise as there is a 4GB ram limit and the model is too big to be held by webassembly).

I used wgpu as the backend. So the implementations should also work native with Vulkan, Metal or DirectX. All kernels are implemented by myself. I tried my best to make the shaders more or less performant. But for some shaders (e.g. matmul) it is quite hard to get the best performance in all scenarios.

Feature Support Table

Feature Support Status Notes
Data Types
f32 ✅ Supported
u32 ✅ Supported
u8 ⚠️ Only Output of Cmp *Only f32, I32 and U32 are available in a webGpu shader
i64 ⚠️ Supported Native
f64 ⚠️ Supported Native
f16 ⚠️ Only in Quantized Matrices
bf16 ❌ Not Supported
Operations All operations support non-contiguous arrays
Unary Operations ✅ Supported
Binary Operations ✅ Supported
MatMul ✅ Supported
Reduce Operations ✅ Supported Sum, Min, Max, (ArgMax, ArgMin works only if continues Dimensions are reduced)
Conv2d ✅ Supported
Conv2dTranspose ✅ Supported
Conv1d ✅ Supported
Conv1dTranspose ✅ Supported
Index Select ✅ Supported
Where_cond ✅ Supported
Pool2dMax ✅ Supported
Pool2dAvg ✅ Supported
Upsample ✅ Supported
Gather ✅ Supported
Scatter_add ✅ Supported
Index_add ✅ Supported
Quantized Matrices ✅ Supported
Not Implemented
ArgSort ❌ Not Implemented

KimHenrikOtte avatar Aug 18 '24 00:08 KimHenrikOtte

@LaurentMazare any chance that the proposed backend/webgpu kernels by @KimHenrikOtte can become integrated at some point?

sluijs avatar Aug 30 '24 21:08 sluijs

That would be a big (and maybe easy thanks to @KimHenrikOtte work) win for candle which already builds so easily on every platform and device (our app ships on android, vision, ios etc and works well with candle) 👍

sinkingsugar avatar Aug 31 '24 03:08 sinkingsugar

@KimHenrikOtte I am interested in contributing to your project if you are willing to accept PR's to your repo. I ended up playing around with it to get clip working with slight(draft) modifications here.

EdupugantiAkhil avatar Sep 01 '24 15:09 EdupugantiAkhil

@EdupugantiAkhil Sure, I would accept PRs to my repo. I could also add you as a collaborator if that would make things easier. I am currently investigating further why the performance is slower in the browser than native.

It turned out that the "8x8 work per thread" matmul shader used performs realy bad.(like 30 times slower than other shaders i tested, not quite sure why, maybe the compiled shader uses too many gpu registers).

KimHenrikOtte avatar Sep 01 '24 19:09 KimHenrikOtte

I did some more work on the branch. Currently the latest version is at: https://github.com/KimHenrikOtte/candle/tree/wgpu_cleanup Current state of the branch:

  • Wgpu backend added
  • Support for f32, u32, u8(Cmp and WhereCond), i64(native) and f64(native)
  • Support for all candle core operations, except asort
  • Support for softmax, rms-norm, layer-norm, sigmoid
  • New to_vecX_async, to_scalar_async and to_device_async functions added for use in the browser.
  • All candle-wasm-examples support the wgpu async implementation.
  • You can define your own WGSL shaders.
  • Converted all tests to WASM-compatible tests with a custom build.rs -> All tests pass
  • Added wgpu_basics.rs example to demonstrate how to use custom shaders.
  • Documented how to use the wgpu backend in the candle book.
  • all new dependencies are behind the wgpu feature
  • removed all debug/test files
  • fixed most clippy warnings

Current issues:

  • There are 3 clippy warnings about a MutexGuard being held over an await point. I am not sure how to fix this, as the mutex needs to be accessed within sync and async functions. I think as long as you are not awaiting on multiple candle functions in parallel, this is not a problem.
  • Implementation/performance tested on Windows with NVIDIA GPU only.

KimHenrikOtte avatar Oct 29 '24 00:10 KimHenrikOtte

I was just thinking about this feature this morning so happy to hear its bring worked on, being able to target WebGPU will unlock so many cool web demos 🤩

mrchantey avatar Oct 29 '24 21:10 mrchantey

I want to draw more attention to the WGPU fork: https://github.com/KimHenrikOtte/candle/tree/wgpu_cleanup

I do not have the resources to test all possible examples or hardware configurations, so I cannot guarantee that everything works perfectly across all scenarios. However, all changes are gated behind the wgpu feature flag, meaning the default functionality should remain unaffected for users who do not enable it. For those who enable the feature flag, this update provides an opportunity to experiment with and test the wgpu backend.

Do you think this branch is in a state where I could open a PR?

Performance Analysis

Although I couldn’t get CUDA running on my system for a proper comparison, the WGPU backend performs significantly better than the CPU implementation for larger models on my test hardware:

Test Hardware:

  • CPU: AMD Ryzen 7 5800X (8-Core Processor)
  • GPU: NVIDIA GeForce GTX 1080 Ti
  • RAM: 3200 MHz
Model/Application Native (WGPU) Native (CPU) Browser (WGPU) Browser (CPU)
Llama2-c 15m 367.04 tok/s (±10.08) 327.56 tok/s (±21.27) 117.49 tok/s 85.72 tok/s
Llama2-c 42m 230.13 tok/s (±1.41) 64.35 tok/s (±1.68) 109.76 tok/s 47.80 tok/s
T5 135.62 tok/s (±10.54) 88.38 tok/s (±6.41)    
Clip 724.92 ms (±182.59) 636.55 ms (±218.33)    
Metavoice 45.66 s (±4.47 s) 617.39 s    
Stable Diffusion 28.33 s (±0.30 s) 338.66 s 65 s N/A
Wuerstchen 135.37 s (±3.52 s) >1740.00 s 160 s N/A

updated at 8eb729f, with 10 runs for llama, t5, clip and 5 runs for metavoice, stable-diffusion and wuerstchen

Commands used to test the models:

cargo run --example llama2-c --release --features "candle-datasets,wgpu" -- inference --which-model stories15M.bin 
cargo run --example llama2-c --release --features "candle-datasets,wgpu" -- inference --which-model stories42M.bin
cargo run --example t5 --release --features="wgpu" -- --model-id "t5-small" --prompt "translate to German: A beautiful candle is glowing at a christmas table. It is very cold outside and snowing." --decode
cargo run --example clip --release --features="wgpu" -- --images "examples/stable-diffusion/assets/stable-diffusion-xl.jpg,examples/yolo-v8/assets/bike.jpg" --sequences  "a cycling race,a photo of two cats,a robot holding a candle"
cargo run --example metavoice --release --features=wgpu -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --out-file "out_meta_wgpu.wav"
cargo run --example stable-diffusion --release --features=wgpu  -- --prompt "Anthropomorphic cat dressed as a fire fighter" --sd-version v1-5 --final-image "out_sd_wgpu.png"
cargo run --example wuerstchen --release  --features="wgpu" -- --prompt "realistic cat dressed as a fire fighter" --final-image "out_wu_wgpu.png"



cargo run --example llama2-c --release --features "candle-datasets,wgpu" -- --cpu inference --which-model stories15M.bin 
cargo run --example llama2-c --release --features "candle-datasets,wgpu" -- --cpu inference --which-model stories42M.bin
cargo run --example t5 --release --features="wgpu" -- --model-id "t5-small" --prompt "translate to German: A beautiful candle is glowing at a christmas table. It is very cold outside and snowing." --decode --cpu
cargo run --example clip --release --features="wgpu" -- --images "examples/stable-diffusion/assets/stable-diffusion-xl.jpg,examples/yolo-v8/assets/bike.jpg" --sequences  "a cycling race,a photo of two cats,a robot holding a candle"  --cpu
cargo run --example metavoice --release --features=wgpu -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --out-file "out_meta_cpu.wav"  --cpu
cargo run --example stable-diffusion --release --features=wgpu  -- --prompt "Anthropomorphic cat dressed as a fire fighter" --sd-version v1-5 --final-image "out_sd_cpu.png"  --cpu
cargo run --example wuerstchen --release  --features="wgpu" -- --prompt "realistic cat dressed as a fire fighter" --final-image "out_wu_cpu.png"  --cpu

KimHenrikOtte avatar Jan 03 '25 02:01 KimHenrikOtte

I was able to test whisper-microphone on my A380 and it seems to be working on. I did try stable diffusion but predictably I wasn't able to generate anything, and when I tried passing --{width,height} 256 it managed to get all the way through until the last step where it locked up. 192 and below "worked" in so much that they generated an image, but no prompt made any thing that resembled the prompt. Maybe f16 support in the future might help it be more usable? metavoice just caused my GPU to run at 16mhz.

Quackdoc avatar Jan 03 '25 11:01 Quackdoc

I was able to test whisper-microphone on my A380 and it seems to be working on. I did try stable diffusion but predictably I wasn't able to generate anything, and when I tried passing --{width,height} 256 it managed to get all the way through until the last step where it locked up. 192 and below "worked" in so much that they generated an image, but no prompt made any thing that resembled the prompt. Maybe f16 support in the future might help it be more usable? metavoice just caused my GPU to run at 16mhz.

I suspect that Stable Diffusion’s performance degrades significantly at smaller image sizes. Even on the CPU, using dimensions of 256 seems to produce images that fail to align with the prompts. This seems to be a limitation of the model itself, rather than an issue with the backend.

As for wgpu, it currently doesn’t support f16, as outlined in this GitHub issue.

KimHenrikOtte avatar Jan 03 '25 14:01 KimHenrikOtte

Do you think this branch is in a state where I could open a PR?

I think opening a pr would be a great next step, it will provide more clarity on the proposal and create a space where others can review and contribute, even if its not yet merge ready 😸

mrchantey avatar Jan 03 '25 23:01 mrchantey

@KimHenrikOtte wgpu f16 support has been merged!

Quackdoc avatar Mar 19 '25 16:03 Quackdoc

For the curious: https://github.com/gfx-rs/wgpu/pull/5701

Big thanks to @FL33TW00D @ErichDonGubler @cwfitzgerald and others who helped get it across the line <3

gabrielgrant avatar Mar 21 '25 18:03 gabrielgrant

I tried to use the wgpu_cleanup branch but it tries to build cudarc, which doesn't work on a ARM macbook machine... edit: I was using https://github.com/philschmid/llama-candle-rs/blob/master/examples/llama.rs which enabled the flash-attn feature of candle-transformers, which is what pulls in CUDA.

theoparis avatar May 08 '25 04:05 theoparis