llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

llama : switch to floating-point token positions

Open ggerganov opened this issue 1 year ago • 2 comments

Change llama_pos from int32_t to float

This change might seem unnecessary at first as we are used to think about token positions as integers, but technically nothing prevents these to be floats. Also, I'm having some ideas for KV cache compression / context extension tricks and having float positions could turn out to be useful.

Still contemplating if we should merge this, so for now just a draft

ggerganov avatar Feb 23 '24 10:02 ggerganov

+1 For this, I'm wondering if it helps simplifying the code of group attention (self-extend)

ngxson avatar Feb 23 '24 12:02 ngxson

Not sure if it will become simpler, but one of the things I want to investigate is to apply floating-point division in llama_kv_cache_seq_div() instead of the current integer division. Intuitively, I expect to improve the recall quality

The other idea I want to explore is to merge KV cells into one another via averaging both of the positions and the KV values. Wondering if this can be applied to compress the KV cache data into fewer cells

ggerganov avatar Feb 23 '24 13:02 ggerganov