llama.cpp
llama.cpp copied to clipboard
llama : switch to floating-point token positions
Change llama_pos from int32_t to float
This change might seem unnecessary at first as we are used to think about token positions as integers, but technically nothing prevents these to be floats. Also, I'm having some ideas for KV cache compression / context extension tricks and having float positions could turn out to be useful.
Still contemplating if we should merge this, so for now just a draft
+1 For this, I'm wondering if it helps simplifying the code of group attention (self-extend)
Not sure if it will become simpler, but one of the things I want to investigate is to apply floating-point division in llama_kv_cache_seq_div() instead of the current integer division. Intuitively, I expect to improve the recall quality
The other idea I want to explore is to merge KV cells into one another via averaging both of the positions and the KV values. Wondering if this can be applied to compress the KV cache data into fewer cells