llama.cpp
llama.cpp copied to clipboard
llama : refactor llama_kv_cache, llama_context and llm_build_context
Overview
This PR is an intermediate step towards a more generic implementation that will support different underlying implementations of llama_kv_cache, llama_context and the graph building logic (a.k.a. llm_build_context). The llama_kv_cache is also introduced in the public API as an object, but it's actual functionality is yet to be defined in follow-up PRs.
Currently, no functional changes have been introduced. Mainly the code has been reorganized in a way to allow implementing new abstractions. The main changes in the implementation are:
- Avoid all explicit references to
llama_kv_cacheinllm_build_context. The goal is to be able to construct the computation graphs only through the abstractllama_contextinterface, which will hide the actual KV cache implementation and thus allow to be overloaded based on the parameters of the specific use case. More generally, thellama_contexthides not only the KV cache implementation, but all the internal state (such as, applied adapters, masks, etc. if any) with the exception of the model weights - these are still available to thellm_build_contextin order to be able to construct the backbone graph of the various architectures. - Avoid all explicit references to
llama_kv_cacheinllama_decode/llama_encode. These are abstracted through a new objectllama_batch_managerwhich is produced by the currentllama_context. Again the goal is to not make explicit assumptions about the underlying KV cache implementation while processing the batches and be able to delegate this logic to thellama_context. Thellama_batch_manageris produced by thellama_contextand will handle logic such as, restoring the KV cache state to consistent state upon errors, batching the input batch into micro batches according to the internal processing logic, etc. - Add initial serialization primitives to
llama_kv_cache. In the future, these will be overloaded for the specific KV cache implementations through a common abstract interface.
The modifications so far are quite substantial and touch too many lines. Even though the code is in a very intermediate state, with many members still publicly exposed and without proper object-oriented implementation in place, it should still be mergeable.
The general class hierarchy that I have in mind is like this:
graph TD;
llama_kv_cache_unified --> llama_kv_cache;
llama_kv_cache_standard --> llama_kv_cache;
llama_kv_cache_mamba --> llama_kv_cache;
... --> llama_kv_cache;
Here, llama_kv_cache_unified is basically the llama_kv_cache implementation that we currently have. In the future, we will add more implementations that would be appropriate for multi-user scenarios (e.g. llama_kv_cache_standard) or for Mamba architectures (llama_kv_cache_mamba).
graph TD;
llama_context --> llama_model;
llama_context --> llama_cparams;
llama_context --> llama_adapter;
llama_context --> etc..;
llama_context[<b>llama_context</b>];
llama_context_no_kv[<b>llama_context_no_kv</b><br><br>];
llama_context_unified[<b>llama_context_unified</b><br><br>llama_kv_cache_unified];
llama_context_standard[<b>llama_context_standard</b><br><br>llama_kv_cache_standard];
llama_context_mamba[<b>llama_context_mamba</b><br><br>llama_kv_cache_mamba];
llama_context_enc_dec[<b>llama_context_enc_dec</b><br><br>llama_kv_cache_standard];
llama_context_no_kv -.-> llama_context;
llama_context_unified -.-> llama_context;
llama_context_standard -.-> llama_context;
llama_context_mamba -.-> llama_context;
llama_context_enc_dec -.-> llama_context;
... -.-> llama_context;
The base llama_context class will implement common functionality such as low-level ggml buffer and backend management + adapters, without the notion of a KV cache. The derived classes will specialize the llama_context for different use-cases.
The llm_build_context would operate only through the llama_build_i interface and the batch processing will respectively only interact with the llama_batch_manager_i interface. The type of llama_context to construct in functions such as llama_init_from_model() would be determined based on the model and the specified context parameters. For example, the user would be able to create both llama_context_unified and llama_context_standard for a LLM_ARCH_QWEN2 model. Or a llama_context_no_kv for an encoding-only LLM_ARCH_BERT model. And so on.
API changes
The current changes are only necessary to make the API more consistent in following the naming convention. To migrate, simply replace the old API calls with the new ones.
- Deprecate
llama_kv_cache_...API - Add
llama_kv_self_...API
In the future, the llama_kv_cache_... API will be changed to work with struct llama_kv_cache instead of struct llama_context and the functionality will be extended to support things like saving, copying, loading, etc.
Notes
- [x] Fix
build_qwen2vl, inp_pos,lctx.n_pos_per_tokenhack - [x] Worst case for
n_outputsandn_outputs_encinllm_build_contextseem incorrect - [x] Remove
inp_s_seq- not used - [x] fix
onst bool is_sliding = il % sliding_window_pattern < (sliding_window_pattern - 1); truct ggml_tensor * KQ_mask_l = is_sliding ? KQ_mask_swa : KQ_mask; - [x] Fix T5
- [x] Fix RWKV
- [ ] Fix
batch.pos == NULL-llama_context::pos_max()is used incorrectly - [x] Dedup the reserve code
- [x] Errors on unimplemented interface
- [x] Build multiple graphs per model (e.g. enc, dec, no-logits, etc.)
- [x] Implement causal input for cache-less
llama_context - [ ] Simplify
encode()/decode() - [x] Remove
worst_casefrom thellama_graph_iAPI? - [x] Wrap input tensors in structs
- [x] Add trace logs
PRs to resolve
- [x] https://github.com/ggerganov/llama.cpp/pull/11381 (e665b57)
- [ ] https://github.com/ggerganov/llama.cpp/pull/11446
- [x] https://github.com/ggerganov/llama.cpp/pull/11445
- [x] https://github.com/ggerganov/llama.cpp/pull/10573
- [ ] https://github.com/ggml-org/llama.cpp/pull/12108
New features
- [ ] https://github.com/ggerganov/llama.cpp/pull/11571
I am thinking about the following API change for this PR:
// API on `master`
DEPRECATED(LLAMA_API void llama_kv_cache_clear(ctx));
DEPRECATED(LLAMA_API bool llama_kv_cache_seq_rm(ctx));
DEPRECATED(LLAMA_API void llama_kv_cache_seq_cp(ctx));
DEPRECATED(LLAMA_API void llama_kv_cache_seq_keep(ctx));
DEPRECATED(LLAMA_API void llama_kv_cache_seq_add(ctx));
DEPRECATED(LLAMA_API void llama_kv_cache_seq_div(ctx));
DEPRECATED(LLAMA_API llama_pos llama_kv_cache_seq_pos_max(ctx));
DEPRECATED(LLAMA_API void llama_kv_cache_defrag(ctx));
DEPRECATED(LLAMA_API bool llama_kv_cache_can_shift(ctx));
DEPRECATED(LLAMA_API void llama_kv_cache_update(ctx));
// works with `ctx.kv_self` - backwards compatible with `master`
LLAMA_API void llama_kv_self_clear(ctx);
LLAMA_API bool llama_kv_self_seq_rm(ctx);
LLAMA_API void llama_kv_self_seq_cp(ctx);
LLAMA_API void llama_kv_self_seq_keep(ctx);
LLAMA_API void llama_kv_self_seq_add(ctx);
LLAMA_API void llama_kv_self_seq_div(ctx);
LLAMA_API llama_pos llama_kv_self_seq_pos_max(ctx);
LLAMA_API void llama_kv_self_defrag(ctx);
LLAMA_API bool llama_kv_self_can_shift(ctx);
LLAMA_API void llama_kv_self_update(ctx);
// TODO: llama_kv_cache API
// can be implemented in a later PR
// new API to access the KV cache instance
struct llama_kv_cache;
LLAMA_API struct llama_kv_cache * llama_get_kv_self(ctx)
LLAMA_API void llama_set_kv_self(ctx, kv);
// allow to clone, free, save, load the kv cache
@slaren Coming back to your comment from earlier: https://github.com/ggerganov/llama.cpp/pull/11110#pullrequestreview-2543301497
- At some point we should abstract eveything needed to model an architecture to a single class (such that each architecture is a subclass of this class)
- After that, llm_type should probably be removed entirely, and each architecture should have its own enum if needed, with a function to return the type as a string (which by default could be "
")
In the OP I have outlined a possible approach to make the implementation more abstract. I have focused primarily on the abstraction of the KV cache and the llama context.
If I understand correctly your suggestion, the idea is to have the compute graph build functions for each of the arches (e.g. build_llama()), become part of llama_model (e.g. implement derived classes llama_model_llama, llama_model_qwen, etc..), which would effectively eliminate the need for llm_build_context. This way, the llama_context would be able to simply call model->build(), instead of relying on the graph to come from "outside". Do I understand correctly the idea?
I haven't really though enough about this to make specific suggestions, but I think the goal should be to have an interface that can be used to define everything necessary to implement a model architecture. Ideally, to add support for a new architecture, it should only be necessary be to define a new class and create a mapping between the architecture name in the GGUF file and this class. There may of course be more classes in the interface, but there should be a single entry point. So this should include more than just the graph build function, it should also include all the functions to load a model, create a context, and everything else that may be necessary to run a model. This interface would also need to be supported by other interfaces such as the KV cache abstraction, and graph building helper functions that are currently in llm_build_context and the other llm_build_* functions.
To do this, I think it would be better to create an abstract interface that contains everything necessary to define a model architecture. I think that's likely to result in a cleaner and more maintainable codebase than using llama_model as a base class. Instead, llama_model (and other classes like llama_context) should use this interface to implement the functionality in llama.cpp. It may also be convenient to have one or more base classes that implement some of the common functionality that is shared between multiple model architectures, but it should not be strictly necessary to use these base classes.
This is of course a very high level suggestion, it will take a lot of work to define all the details.
Thanks for the suggestions. I'll aim to create the abstract model interface and restructure the implementation so that the llm_build_context is no longer needed and all model-specific code is behind the new abstract interface. Will keep hacking on this PR for a while and try to bring it in a more complete state before merging.
I had some ideas about this, not sure if they are feasible, but...
- Would be nice if we could use caching at the level of individual tensors instead of "combined" tensor groups like KV cache. It would give more flexibility to model implementers.
- I'd love to see a generic tensor caching (ability to save a computed tensor to memory and use it in subsequent graph computations),
- key cache and value cache could be a more specialized form of tensor caching using context length dimension. So the current functionality that operates on KV cache would gather all cached tensors of this type and perform necessary operations on them (like prompt caching, context shifts etc)
- recurrent model state caches could be another specialized form of tensor caching (would allow to implement things like prompt caching at discrete context intervals?)
I see that you started working on things in the top down direction, but it would be nice if the bottom of things was flexible enough to avoid per-model boilerplate cache code. Ideally it would be a generic set of bricks that would allow to build caches for any architecture (even hybrid ones like MiniMax-01).
Sorry for my wishful thinking rant 😉, just wanted to provide some food for thought.
@fairydreaming This is useful feedback. I think this should all be possible in the end. Right now things might appear messy, but I will try to add the abstractions soon and hopefully things will become more clear. The recurrent models will have their own implementation of the KV cache. The tensor caching mechanism should be more general, I agree and will try to make progress in that direction too.
Sorry for the slow progress here - been lacking focus for a few days. Hopefully back on track soon.
Hi! I've just done the changes in rwkv part according to what's already in this PR. The code is here: https://github.com/MollySophia/llama.cpp/tree/molly/llama-kv-cache also made the graph building parts of existing rwkv models a bit more tidy
@MollySophia Great! You can open a PR to this branch if you want.
This PR is getting close to completion. Here is an update of the new software architecture:
llama_contextis now a base class implementing a cache-less (i.e. without a KV cache) inference. This means that the compute graphs operate on the current batch and there is no state being preserved acrossencode()/decode()calls. It also provides basicggml-related functionality such as initializing backends and preparing output buffers for common things, such as logits, embeddings, etc.- The
llama_contextnow also implements a "graph building" interfacellama_graph_i. The idea is that every model will utilize this interface to create its compute graphs. For example, where a model requires an attention block, it now callsllama_graph_i::build_attn()and delegates the logic to the specific instance of thellama_context. This is the main change that was needed to decouple the KV cache from the graph build logic and enables the implementation of new KV caches and model architectures. - The base
llama_contextis used for encoder-only models such as BERT since they do not require a KV cache. But also can work with any decode model as well - it will just be slow because there is no cache. - The
llama_context_kv_selfclass inheritsllama_contextand adds allama_kv_cacheinstance to support self-attention caching. This class is mainly useful for decoder models. In the future, it will support a "per-sequence" KV cache variation which will be utilized in multi-user use cases, because the regular unified KV cache is not optimal in such scenarios. - The
llama_context_recurrentclass is used for recurrent models, such as RWKV and Mamba. It utilizes allama_kv_cache_recurrentinstance, which currently is temporary implemented as a regularllama_kv_cache. In the future, we will implement a recurrent-specific cache that is suitable for these architectures. (cc @compilade) - Before merging, I need to add an encoder-decoder context for models such as T5. I'm still working out the details, but I am thinking along the lines of composing 2 context - one for the encoder and one for the decoder. This should serve as the basic example for multi-modal support later on.
- The model definitions are now fully contained in
llama-model.cpp. This includes hparams and data loading + graph builds. We are still utilizing a "build context" like before, but it is now simplified. In follow-up PRs, this will be improved by moving each model definition in a separate class. - The
encode()/decode()implementations of eachllama_contextcurrently have a lot of duplicated code. These should be improved in the future. The main purpose of these calls is to perform the micro-batching according to the attention implementation of the context and in order to improve this, there are likely changes needed to thellama_ubatch,llama_sbatchandllama_batchobjects. So to avoid increasing the scope of this PR even further, these will be reworked later on, likely within #11875. - We can now also implement a MLA-specific
llama_context_kv_self_mlafor R1 models. It can have a customizedllama_kv_cache_mlaimplementation + extra context state and custom attention implementation. (cc @fairydreaming)
Pinging @MollySophia and @compilade if you could run some tests with this branch to check if the RWKV and Mamba models work correctly.
Any suggestions for improving the code are welcome. Hoping to have this ready for review in the next few days.
I've been quite on/off recently, but hopefully I can have a deeper look into this during the weekend.
The
llama_contextnow also implements a "graph building" interfacellama_graph_i. The idea is that every model will utilize this interface to create its compute graphs. For example, where a model requires an attention block, it now callsllama_graph_i::build_attn()and delegates the logic to the specific instance of thellama_context. This is the main change that was needed to decouple the KV cache from the graph build logic and enables the implementation of new KV caches and model architectures.
@ggerganov I see that there is an implicit assumption in build_attn() method that full Q, K and V vectors exist. But the main idea of #11446 is to avoid calculation of full Q K V full vectors. How do you plan to handle this case? Shall I add a separate build_attn_mla() method across the codebase starting from the llama_graph_i interface and going through llama_context up to the new llama_context_kv_self_mla?
Also, since the model definition is now fully contained inside llama-model.cpp, I'm wondering if we can maybe implement some sort of "graph check" in the future. The idea is to export a graph for a given model as a "snapshot", then have a CI check to make sure they are not inadvertently changed. Ideally, this can be done without loading model weight or creating llama_context / KV cache.
Currently, we can debug the cgraph using ggml_graph_dump_dot, but this requires loading the model weight which will be impossible for very large models (>70B)
I'm not sure if the idea is worth exploring, but I can create a dedicated issue to discuss more if needed.
The
llama_contextnow also implements a "graph building" interfacellama_graph_i. The idea is that every model will utilize this interface to create its compute graphs. For example, where a model requires an attention block, it now callsllama_graph_i::build_attn()and delegates the logic to the specific instance of thellama_context. This is the main change that was needed to decouple the KV cache from the graph build logic and enables the implementation of new KV caches and model architectures.@ggerganov I see that there is an implicit assumption in
build_attn()method that full Q, K and V vectors exist. But the main idea of #11446 is to avoid calculation of full Q K V full vectors. How do you plan to handle this case? Shall I add a separatebuild_attn_mla()method across the codebase starting from thellama_graph_iinterface and going throughllama_contextup to the newllama_context_kv_self_mla?
@fairydreaming Yes, you should add a new llama_graph_i::build_attn_mla() in case the current signature of llama_graph_i::build_attn() is not suitable for the MLA use case. The default implementation in llama-graph.cpp would be to return an error of not being implemented. You then don't need to make any changes to llama_context - you only need to implement it in llama_context_kv_self_mla.
Looks good overall. Some points I'm thinking about for my vision PR:
- Having a derived class
llama_vision_context : llama_contextas you said- Input image tokens will be obtained via
llama_batch_ext, they will be passed tollama_vision_context::input_setwhich can work with pixel values instead of text token- Output tensor will be saved to
llama_context::embd_tensor==> need to add this to the base class
Overall yes. The details are not yet clear to me completely - I think once the T5 encoder-decoder use case is implemented we will have a more clear picture and a starting point for multi-modal support.
What I am trying to do is to be able to compose the llama_contexts together. For example, if we look at the Whisper model (because I am familiar with it), it is essentially an Encoder + Decoder. So it should be possible to implement it using the same llama_context_enc_dec that we would use for the T5 models. At the same time, we should also have the option to create the encoder/decoder contexts individually. For example, I would be able to say: create a llama_context_enc from this Whisper model. And this context will be used only for encoding audio to embeddings.
Extending this analogy, a vision model is likely to fit in the same llama_context_enc_dec implementation because it is again an Encoder + Decoder stitched together. The specific input type is not relevant to the context's purpose/implementation.
@ggerganov I think there's still one thing missing. There should be an abstract kv cache interface, llama_kv_cache_i or something like this that caches would implement (and llama_context::get_kv_self() would return this type). I see that you initially planned that llama_kv_cache type would serve as a base type for caches, but as far as I can see this is not implemented (the current llama_kv_cache still contains code specific to caching K/V vectors inside and there's no llama_kv_cache_unified).
Pinging @MollySophia and @compilade if you could run some tests with this branch to check if the RWKV and Mamba models work correctly.
Any suggestions for improving the code are welcome. Hoping to have this ready for review in the next few days.
Hi! Sry for my late response. I'll do some tests this weekend or next week. have been busy these days :p
@ggerganov I think there's still one thing missing. There should be an abstract kv cache interface,
llama_kv_cache_ior something like this that caches would implement (andllama_context::get_kv_self()would return this type). I see that you initially planned that llama_kv_cache type would serve as a base type for caches, but as far as I can see this is not implemented (the currentllama_kv_cachestill contains code specific to caching K/V vectors inside and there's nollama_kv_cache_unified).
@ggerganov A gentle nudge in case you missed my last comment. I need this to resume work on #11446.
@fairydreaming Yes, there should be an abstract interface for the KV cache. Will add this today.
@ggerganov I got DeepSeek R1 working with custom MLA cache and context type (still have to test cache save/restore), few thoughts that came to my mind while working on this:
- in llama.h
llama_context_paramsthere are still parameters specific to only one type of KV cache:type_kandtype_v. We need something more general to handle different kinds of caches (maybe union of structs, one for each kind of context/cache?). - I see that creation of context in
llama_init_from_model()is currently based onmodel->archvalue. I think it would be a good idea to allow to override type of created context in command line params. To do this we would need context type enum in llama.h and context type field instruct llama_context_paramsand inllama_cparams. This could be useful either for debugging purposes (running models without KV cache for whatever reason) and for selection of attention implementation for DEEPSEEK2 architecture (naive vs MLA - now they directly map to two different llama context types and could both be supported without creating another architecture).
Yes, this is all planned. I am currently working on a relatively big change to make the inputs decoupled from the llama_context and after that will fix these points.
This could be useful ... and for selection of attention implementation for DEEPSEEK2 architecture (naive vs MLA - now they directly map to two different llama context types and could both be supported without creating another architecture)
@fairydreaming Curious, how did you achieve this, since the naive and MLA attention have different signatures, correct? So when you build the graph, you need to stick to one of the contexts - the one that implements the specific interface.
This could be useful ... and for selection of attention implementation for DEEPSEEK2 architecture (naive vs MLA - now they directly map to two different llama context types and could both be supported without creating another architecture)
@fairydreaming Curious, how did you achieve this, since the naive and MLA attention have different signatures, correct? So when you build the graph, you need to stick to one of the contexts - the one that implements the specific interface.
@ggerganov I don't have this implemented, it's just an idea. But I don't see a reason why this wouldn't work. I mean it's just a matter of creating given context object in llama_init_from_model() and adding one if in build_deepseek2() when building attention part of the graph, something like this:
...
if (llama_cparams.context_type == LLAMA_CONTEXT_KV_SELF) {
// naive attention impl
calculate full Q K V vectors
build_attn(... q, k, v, ...)
} else if (llama_cparams.context_type == LLAMA_CONTEXT_KV_SELF_MLA) {
// optimized MLA attention impl
calculate only kv_compressed, k_pe, q_pe, q_nope
build_attn_mla(... kv_compressed, k_pe, q_pe, q_nope, ...)
}
...
Superseded by https://github.com/ggml-org/llama.cpp/pull/12181
I try build last version on termux with and without openblas and at all I have this error can you help ./bin/llama-cli CANNOT LINK EXECUTABLE "./bin/llama-cli": cannot locate symbol "llama_kv_self_seq_rm" referenced by "/data/data/com.termux/files/home/llama.cpp/bin/llama-cli"...