[Kernels] Migrate sampling to WebGPU

Open akaashrp opened this issue 1 month ago • 0 comments

Performance Comparison with v0.2.79: Compared performance for "canonical" flows averaged across 20 runs

v0.2.79 performance: ~38.17 decode tokens/s Post-PR performance: ~38.99 decode tokens/s

Notes:

The minimal performance improvement is likely due to kernel launch overheads. Specifically, we need to call three kernels to perform sampling (fsoftmaxWithTemperature, fargsortProbs, fSampleWithTopP).
This will likely scale better for simultaneous sampling from multiple sequences.

Nov 02 '25 05:11 akaashrp