candle
candle copied to clipboard
WIP: Precompile metal kernels into `.metallib` files
I'm running into an issue where the first time I call apply_repeat_penalty, it takes a very long time (in excess of 6 seconds). It seems to be coming from the Tensor::to_vec1d call to move the logits into a Vec<f32>. It seems like a simple copy like this would be very fast.
It was suggested on Discord that this slowness might be due to some other async stuff happening in Metal, maybe the compilation of the kernels on first load.
This PR precompiles the kernels at build time instead of on every run. Unfortunately, it doesn't seem to solve my problem but it might be useful for other reasons.
Looking at #2322, will likely need to reconfigure the build script to optionally produce iOS .metallibs
@LaurentMazare is this something you think you would want to potentially merge? if so I can clean it up. Otherwise, we can close it.
Going to close this due to inactivity, if it's something we feel like we need in the future this should serve as a good starting point.