web-llm
web-llm copied to clipboard
[Kernels] Migrate sampling to WebGPU
Performance Comparison with v0.2.79: Compared performance for "canonical" flows averaged across 20 runs
- No logit_bias
- No logitProcessor
- Applied frequency, presence, and repetition penalties
- Use logprobs
- No top_logprobs
v0.2.79 performance: ~38.17 decode tokens/s Post-PR performance: ~38.99 decode tokens/s
Notes:
- The minimal performance improvement is likely due to kernel launch overheads. Specifically, we need to call three kernels to perform sampling (fsoftmaxWithTemperature, fargsortProbs, fSampleWithTopP).
- This will likely scale better for simultaneous sampling from multiple sequences.