Feature Request: Add kv-quant fa kernel variants for head sizes other than 128
Prerequisites
- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Currently llama cpp has many variants of fa kernel for hs 128, but only a few for hs 64 and 256, which causes fallback to the CPU in case of using -ctk != f16 with models with hs != 128
Motivation
Llama 3.2 1b and gemma 3-12b use hs 64 and 256 respectively, and these seem to be quite popular models for some applications
Possible Implementation
More kernel templates need to be added, also flag like GGML_CUDA_FA_ALL_KVQ_HS can be added for enabling these templates, because I understand that adding these templates will increase compilation times dramatically
this very needed
This issue was closed because it has been inactive for 14 days since being marked as stale.