candle icon indicating copy to clipboard operation
candle copied to clipboard

[Draft] Enable CPU multithreading in WASM with Rayon

Open lucky-bai opened this issue 3 months ago • 0 comments

Motivation

Candle's WASM build currently runs on a single CPU thread, which makes it significantly slower than it could be. This PR provides a working demo of multithreaded WASM support in the phi model example by integrating wasm-bindgen-rayon to leverage the existing Rayon-based parallelism in the CPU backend.

Similar libraries, such as Transformers.js, already support multithreading on CPU, so this work should help bring Candle’s WASM performance closer to parity. See also this discussion on other attempts to run the Moshi 1B STT model in WASM faster than real time.

This is an experimental but functional demo: on my MacBook Pro, running the Phi-1.5 Q4_K model, throughput improved by about 3×, from ~5 tokens/sec to ~16 tokens/sec.

Risks and Limitations

  1. The wasm-bindgen-rayon dependency requires several Rust features that are not yet available on the stable branch, so the toolchain only works on the nightly Rust build.
  2. It also requires the hosting server to send specific COOP/COEP headers in order to enable the SharedArrayBuffer needed for multithreading. This necessitates workarounds to load external resources like Tailwind from CDN that would otherwise be blocked.

Despite these limitations, adding multithreading to a WASM model is feasible with minimal code changes, and the performance gains are substantial, so I think it would be worth adding support officially under some kind of experimental feature flag.

lucky-bai avatar Aug 23 '25 17:08 lucky-bai