web-llm icon indicating copy to clipboard operation
web-llm copied to clipboard

[Kernels] Migrate sampling to WebGPU

Open akaashrp opened this issue 1 month ago • 0 comments

Performance Comparison with v0.2.79: Compared performance for "canonical" flows averaged across 20 runs

  • No logit_bias
  • No logitProcessor
  • Applied frequency, presence, and repetition penalties
  • Use logprobs
  • No top_logprobs

v0.2.79 performance: ~38.17 decode tokens/s Post-PR performance: ~38.99 decode tokens/s

Notes:

  1. The minimal performance improvement is likely due to kernel launch overheads. Specifically, we need to call three kernels to perform sampling (fsoftmaxWithTemperature, fargsortProbs, fSampleWithTopP).
  2. This will likely scale better for simultaneous sampling from multiple sequences.

akaashrp avatar Nov 02 '25 05:11 akaashrp