Gaurav Garg
Gaurav Garg
@gerganov @slaren @agray3 I'm interested in reducing the CPU overhead associated with building the GGML graph and would like to follow up on this PR. In particular, I'd like to...
Thanks @ggerganov for the quick response. This is exactly what I was proposing above: "A potential solution is to introduce a specialized copy operator for the KV cache that fuses...
Thanks, this makes sense. Do we need a specialized function to handle transposed v-cache or `ggml_set_rows` will be enough? Check this part of the code: https://github.com/ggml-org/llama.cpp/blob/7675c555a13c9f473249e59a54db35032ce8e0fc/src/llama-kv-cache-unified.cpp#L668-L673 Update: Never mind, I...