llama.cpp
llama.cpp copied to clipboard
Reducing the time needed to reload a piece of text into the model by caching the state
Hey!
Is it possible to add a way of dumping the current state into a file, so it can then be reloaded later? This would avoid the time needed to reload a long prompt over and over again.
Thanks Niansa
#174 also asked this, or do you have something else in mind?
Thank you for using llama.cpp and thank you for sharing your feature request! You'll be excited to hear that what you're requesting is my top priority right now. I'm using #91 as the best place to discuss this, since the solution will entail using mmap(). Everyone is welcome to participate in helping us find the best solution. I believe mmap() will reduce startup latency to effectively zero, for everyone, and it'll work on nearly every platform on earth, including Windows, which has a nearly equivalent API.
I think this is a different issue — that one is about changing how the model is loaded, this one is about reducing the time needed to reload a piece of text into the model by caching the state.
As you wish. Re-opening.
#174 also asked this, or do you have something else in mind?
Basically yes, except that interactive user input and generated results should be saved too. So basically you can save and stop and just continue where the model/you've left off later or on another PC.
I can't find it now, but @ggerganov said save/restore of the k&v tensors would preserve the state, iirc. https://github.com/ggerganov/llama.cpp/blob/721311070e31464ac12bef9a4444093eb3eaebf7/main.cpp#L79-L82
@bitRAKE Yes, those are transformer's hidden state, preserving them is sufficient. Now, the question is how to edit them properly. I'm also interested in removing n first elements to deal with context memory filling up.
This issue is a duplicate of #64, isn't it? Since llama-rs did essentially the same thing, first in https://github.com/rustformers/llama-rs/pull/14, then with a slightly different interface in https://github.com/rustformers/llama-rs/issues/38, this is definitely feasible and would be really useful.
May I suggest to close this issue and continue discussion in #64?
One use case that would benefit greatly from session (KV) caching is story generation: start with an initial prompt and then continue down the most promising alternatives that are being generated.
Yes, it is the same