llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: Add kv-quant fa kernel variants for head sizes other than 128

Open pl752 opened this issue 8 months ago • 1 comments

Prerequisites

  • [x] I am running the latest code. Mention the version if possible as well.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently llama cpp has many variants of fa kernel for hs 128, but only a few for hs 64 and 256, which causes fallback to the CPU in case of using -ctk != f16 with models with hs != 128

Motivation

Llama 3.2 1b and gemma 3-12b use hs 64 and 256 respectively, and these seem to be quite popular models for some applications

Possible Implementation

More kernel templates need to be added, also flag like GGML_CUDA_FA_ALL_KVQ_HS can be added for enabling these templates, because I understand that adding these templates will increase compilation times dramatically

pl752 avatar Apr 17 '25 03:04 pl752

this very needed

betweenus avatar Apr 25 '25 17:04 betweenus

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 09 '25 01:06 github-actions[bot]