candle
candle copied to clipboard
example: add quantized qwen3 wasm with SIMD optimizations
Adds a new WASM example for running quantized Qwen3-0.6B models in the browser with WebAssembly SIMD optimizations.
Features
- SIMD-optimized inference: Leverages WASM SIMD128 instructions for accelerated matrix operations
- GGUF quantization support: Q8_0 (default) and Q4_K_M quantization formats
- Interactive web interface: Real-time text generation with performance profiling
- Auto-download server: Python server with CLI for automatic model downloads from HuggingFace
Performance
Testing shows Q8_0 provides superior throughput despite larger size:
- Q8_0: 9.5 tok/s (~645MB) @50 tokens
- Q4_K_M: 6.0 tok/s (~380MB) @50 tokens
Q8_0 is set as default for optimal performance.
Changes
- Added
clear_kv_cache()method toquantized_qwen3::ModelWeights(required for interactive generation but not yet available in candle-nn KVCache) - New example in
candle-wasm-examples/quant-qwen3/ - Includes profiler, memory tracking, and generation controls
- Python server supports custom model paths and configurable ports
Usage
wasm-pack build --target web --release
./serve.py
# Open http://localhost:8080