candle
candle copied to clipboard

Published 20 hours ago •

Reame
Issues

example: add quantized qwen3 wasm with SIMD optimizations

Open DrJesseGlass opened this issue 1 month ago • 0 comments

Adds a new WASM example for running quantized Qwen3-0.6B models in the browser with WebAssembly SIMD optimizations.

Features

SIMD-optimized inference: Leverages WASM SIMD128 instructions for accelerated matrix operations
GGUF quantization support: Q8_0 (default) and Q4_K_M quantization formats
Interactive web interface: Real-time text generation with performance profiling
Auto-download server: Python server with CLI for automatic model downloads from HuggingFace

Performance

Testing shows Q8_0 provides superior throughput despite larger size:

Q8_0: 9.5 tok/s (~645MB) @50 tokens
Q4_K_M: 6.0 tok/s (~380MB) @50 tokens

Q8_0 is set as default for optimal performance.

Changes

Added clear_kv_cache() method to quantized_qwen3::ModelWeights (required for interactive generation but not yet available in candle-nn KVCache)
New example in candle-wasm-examples/quant-qwen3/
Includes profiler, memory tracking, and generation controls
Python server supports custom model paths and configurable ports

Usage

wasm-pack build --target web --release
./serve.py
# Open http://localhost:8080

Oct 31 '25 15:10 DrJesseGlass