llama.cpp Reducing the time needed to reload a piece of text into the model by caching the state

Reducing the time needed to reload a piece of text into the model by caching the state

Open niansa opened this issue 1 year ago • 7 comments

Hey!

Is it possible to add a way of dumping the current state into a file, so it can then be reloaded later? This would avoid the time needed to reload a long prompt over and over again.

Thanks Niansa

Mar 16 '23 09:03 niansa

#174 also asked this, or do you have something else in mind?

Mar 16 '23 10:03 bitRAKE

Thank you for using llama.cpp and thank you for sharing your feature request! You'll be excited to hear that what you're requesting is my top priority right now. I'm using #91 as the best place to discuss this, since the solution will entail using mmap(). Everyone is welcome to participate in helping us find the best solution. I believe mmap() will reduce startup latency to effectively zero, for everyone, and it'll work on nearly every platform on earth, including Windows, which has a nearly equivalent API.

Mar 16 '23 11:03 jart

I think this is a different issue — that one is about changing how the model is loaded, this one is about reducing the time needed to reload a piece of text into the model by caching the state.

Mar 16 '23 12:03 j-f1

As you wish. Re-opening.

Mar 16 '23 16:03 jart

#174 also asked this, or do you have something else in mind?

Basically yes, except that interactive user input and generated results should be saved too. So basically you can save and stop and just continue where the model/you've left off later or on another PC.

Mar 16 '23 17:03 niansa

I can't find it now, but @ggerganov said save/restore of the k&v tensors would preserve the state, iirc. https://github.com/ggerganov/llama.cpp/blob/721311070e31464ac12bef9a4444093eb3eaebf7/main.cpp#L79-L82

Mar 16 '23 18:03 bitRAKE

@bitRAKE Yes, those are transformer's hidden state, preserving them is sufficient. Now, the question is how to edit them properly. I'm also interested in removing n first elements to deal with context memory filling up.

Mar 16 '23 18:03 jarcen

This issue is a duplicate of #64, isn't it? Since llama-rs did essentially the same thing, first in https://github.com/rustformers/llama-rs/pull/14, then with a slightly different interface in https://github.com/rustformers/llama-rs/issues/38, this is definitely feasible and would be really useful.

May I suggest to close this issue and continue discussion in #64?

One use case that would benefit greatly from session (KV) caching is story generation: start with an initial prompt and then continue down the most promising alternatives that are being generated.

Mar 30 '23 15:03 sgoll

Yes, it is the same

Mar 30 '23 17:03 ggerganov

llama.cpp llama.cpp copied to clipboard

Reducing the time needed to reload a piece of text into the model by caching the state

llama.cpp
llama.cpp copied to clipboard