Where is `apply_penalty_inplace_kernel` used?
Hi,
I'm doing some work looking into the performance and safety of WebGPU under different workloads, and I was doing some characterization of WebLLM (btw, this is awesome work and a great resource, thanks for building it!).
I had a question about one particular kernel, apply_penalty_inplace_kernel. I see that it's being compiled when loading the model, but I don't see it being run during inference. Is there a setting I'm missing that causes this kernel to apply, or certain models that it runs for and not others?
Hopefully this is the right place to ask, and thanks in advance for any info!
Thanks for your interest in WebLLM!
This kernel is indeed not used in WebLLM yet, as WebLLM currently applies penalties using CPU (i.e. not using a GPU kernel).
This is something we want to optimize.