candle icon indicating copy to clipboard operation
candle copied to clipboard

example: add quantized qwen3 wasm with SIMD optimizations

Open DrJesseGlass opened this issue 1 month ago • 0 comments

Adds a new WASM example for running quantized Qwen3-0.6B models in the browser with WebAssembly SIMD optimizations.

Features

  • SIMD-optimized inference: Leverages WASM SIMD128 instructions for accelerated matrix operations
  • GGUF quantization support: Q8_0 (default) and Q4_K_M quantization formats
  • Interactive web interface: Real-time text generation with performance profiling
  • Auto-download server: Python server with CLI for automatic model downloads from HuggingFace

Performance

Testing shows Q8_0 provides superior throughput despite larger size:

  • Q8_0: 9.5 tok/s (~645MB) @50 tokens
  • Q4_K_M: 6.0 tok/s (~380MB) @50 tokens

Q8_0 is set as default for optimal performance.

Changes

  • Added clear_kv_cache() method to quantized_qwen3::ModelWeights (required for interactive generation but not yet available in candle-nn KVCache)
  • New example in candle-wasm-examples/quant-qwen3/
  • Includes profiler, memory tracking, and generation controls
  • Python server supports custom model paths and configurable ports

Usage

wasm-pack build --target web --release
./serve.py
# Open http://localhost:8080

DrJesseGlass avatar Oct 31 '25 15:10 DrJesseGlass