llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Implement caching for evaluated prompts

Open abetlen opened this issue 1 year ago • 27 comments

The goal of this feature is to reduce latency for repeated calls to the chat_completion api by saving the kv_cache keyed by the prompt tokens.

The basic version of this is to simply save the kv_state after the prompt is generated.

Additionally we should investigate if it's possible save and restore the kv_state after the completion has been generated as well.

abetlen avatar Apr 08 '23 08:04 abetlen

I don't understand why we don't just use interactive mode. Almost all the users coming from llama.CPP Are used to an interface where they send a message and get a quick response because there is no clearing of state between messages from the user, meaning there's also no need to reload the state. As I understood it, the KV cache was a way to store the prompt state, because it is used multiple times during the course of a conversation and helps improve responsiveness during long conversations. Given the way that it's used in the base llama.CPP executable, and the fact that in the current implementation of interactive mode, storing of the entire conversation state won't improve performance (it would only allow continuation of previous conversations during a different session), I don't know that this is something that they're going to add in the immediate future.

For me, being able to get completions from the bot with full context of the ongoing conversation is my main use case. So there's pretty much no situation where I would want the current conversation context cleared or reset.

And I thought this was similar for the openAI implementation, where you send the current message, but don't need to send the full message history. Any type of recomputation or loading of model state decreases the performance and makes it slower than the base llama.CPP implementation imo.

After all, I think if people are using chat mode, from a user perspective, they want a continuous and performant chat. Even if that means running models and independent contexts simultaneously, which reduces scalability in the short term without the ability to load and save the states.

MillionthOdin16 avatar Apr 11 '23 02:04 MillionthOdin16

@MillionthOdin16 are you talking about the OpenAI server or just using the Llama class? For the actual OpenAI API each request is entirely independent of all other requests (e.g. you always send the full history to the /chat/completions endpoint) so you do need to reset the model each time. This is why I'm looking into the KV state solution so we can just reload the state if we've seen e.g the first n-1 messages of an n message chat get sent over.

If you're just looking for interactive mode though I believe that's been implemented in this example if you just want to use it in a program and don't care about the API https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/low_level_api_chat_cpp.py

abetlen avatar Apr 11 '23 03:04 abetlen

I'm talking about the openAI server. My point is from the user perspective, the most important factor of chat completions is speed of responses. Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is. So for the chat completions endpoint, this creates an issue because the longer the conversation is, the longer it's going to take to get a response. The reason we have the issue is because the CPP implementation of llama is different than the normal GPU implementations relating to processing of the prompt before it generates a response.

So I'm saying that the most efficient solution in this instance might be to not clear the context and save that processing time for each subsequent completion by keeping the session going. It diverges from how openAI implements it, but it's the only option we have right now. And chat completions isn't usable without it because it's too slow.

MillionthOdin16 avatar Apr 11 '23 03:04 MillionthOdin16

I'm basically advocating for a temporary hack to prevent the context from being cleared during chat completions so that we get a significant performance boost, until either we get a proper state saving capability, or the prompt processing time issue is resolved.

The issue is frustrating because we're so close to having an API that is performant and chat capable, but there's just a couple things holding it back, and I'm advocating for a temporary hack to allow good performance, until we can actually properly implement it.

MillionthOdin16 avatar Apr 11 '23 04:04 MillionthOdin16

Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is

@MillionthOdin16 Is that still meaningfully so since the recent performance regressions appear to have been fixed?

  • Fix proposed: https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1497569526
    • https://github.com/ggerganov/llama.cpp/pull/775
  • Fix seemingly confirmed: https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1498024558
  • Potentially also resolved by the above: https://github.com/ggerganov/llama.cpp/issues/677#issuecomment-1502594077
  • Potentially also resolved by the above: https://github.com/ggerganov/llama.cpp/issues/735#issuecomment-1502595371

0xdevalias avatar Apr 11 '23 04:04 0xdevalias

Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is

@MillionthOdin16 Is that still meaningfully so since the recent performance regressions appear to have been fixed?

So that's not what I mean in this case. I created issue 603 on llama.cpp and now that we have that performance boost, it would be awesome to get as much boost in the API over here as we can. I meant the issue/undetermined cause here: ggerganov/llama.cpp#719

I've seen people more familiar with LLMs mention some oddities about the initial processing in the past, but haven't seen a straightforward explanation. As I understand, it appears llama.cpp has some differences between how it processes the initial data before generating tokens, and it's much slower than the transformer's implementation (CPU vs GPU aside).

So I was just saying that if we get a performance boost from that avenue, as well as the ability to store a conversation's state, a proper implementation will be much faster than atm, and we wouldn't need this workaround. Hope that makes sense.

Right now there's so many small, different issues going on in the main repo that it's hard to keep track of haha.

MillionthOdin16 avatar Apr 11 '23 06:04 MillionthOdin16

+1 to this. Many people are requesting this feature here: https://github.com/oobabooga/text-generation-webui/issues/866

It would be nice to have a use_cache=True flag (or something similar) in Llama.generate.

oobabooga avatar Apr 11 '23 22:04 oobabooga

@oobabooga I'm still working on the caching API but for now I've added a reset option for Llama.generate which defaults to False, if you want to continue from the end of a previous generation just call with reset=False.

abetlen avatar Apr 13 '23 04:04 abetlen

@abetlen is it safe to use reset=False at all times, or will that cause incorrect generations if you completely replace the prompt with a new one?

oobabooga avatar Apr 14 '23 00:04 oobabooga

@oobabooga What reset=False basically does it it maintains the current model context. So once you feed in a prompt with reset=False the prompt remains inside Llama.cpp and will be kept in memory for faster generation of new output.

E.g. You feed in the prompt "My favorite season is", the model replies "spring because everything is growing. I like to walk". Generation stops. You feed in to generate "outside in the spring weather" (not the full prompt!) and to the model the full prompt is now "My favorite season is spring because everything is growing. I like to walk outside in the spring weather".

I tested this in your own webui :grin: by setting reset=False and doing generations in notebook mode, if I clear the textbox the AI still continues chatting as if the original text was still there. The text is also generated right away with no delay as it doesn't need to digest the prompt again!

This is getting a bit off-topic but to implement this in the webui I think the easiest way would be to save the prompt right before sending it to generate. Then when the user calls generate again you can compare the saved prompt with the user's prompt to see if they are merely appending to the existing prompt or are editing it. If it is an append call with reset=False and send only the appended text over, otherwise send everything over and force a reset.

ghost avatar Apr 14 '23 02:04 ghost

for now I've added a reset option for Llama.generate which defaults to False, if you want to continue from the end of a previous generation just call with reset=False.

Will this be exposed through the REST API at some point?

gjmulder avatar Apr 14 '23 04:04 gjmulder

@oobabooga @eiery @gjmulder this is now pushed to main, just looking for someoone to test generate and __call__ before publishing the PyPI release, the code is a bit of a mess right now but the interface should remain the same.

The process to set the cache from code is

llama = llama_cpp.Llama(...)
llama.set_cache(llama_cpp.LlamaCache)

then you can call generate and __call__ as usual and if the prompt contains the previously generated tokens or the previously returned string the cache will just continue from after those tokens / bytes.

If you're using the REST server it's enough to set the CACHE=1 environment variable.

If it works like you guys expect I'll publish to PyPI tonight or tomorrow.

abetlen avatar Apr 15 '23 16:04 abetlen

$ CACHE=1 python3 -m llama_cpp.server ?

gjmulder avatar Apr 15 '23 17:04 gjmulder

@abetlen I have made a test where I generated 80 tokens, and then generated another 80 tokens on top of the result without modifying the prompt. These were the results:

With self.model.set_cache(LlamaCache):

Output generated in 51.90 seconds (1.54 tokens/s, 80 tokens, context 761, seed 559310896)
generate cache hit
Output generated in 61.77 seconds (1.30 tokens/s, 80 tokens, context 841, seed 1141193019)

Without set_cache:

Output generated in 51.90 seconds (1.54 tokens/s, 80 tokens, context 761, seed 1808425735)
Output generated in 55.96 seconds (1.43 tokens/s, 80 tokens, context 841, seed 1321984670)

I can see that there is a new generate cache hit message, but I don't seem to get any performance improvement. Not sure if I am doing this correctly.

oobabooga avatar Apr 15 '23 17:04 oobabooga

$ CACHE=1 python3 -m llama_cpp.server ?

@gjmulder Correct

@oobabooga woops, for generate I implemented the check but didn't actually remove the old tokens from the list of tokens to eval. Should be fixed now.

abetlen avatar Apr 15 '23 21:04 abetlen

@abetlen Caching is working well for me in your latest release :confetti_ball: .

I'm running it using a modified oobabooga UI with self.model.set_cache(LlamaCache) set and generation starts instantly with no ingestion delay if the previous text is not altered. If the text is altered we get a cache miss and regenerate fully with no issues. The performance increase with caching is huge as seen below.

Loading llama-13B-ggml...
llama.cpp weights detected: models/llama-13B-ggml/ggml-model-q4_0.bin

llama.cpp: loading model from models/llama-13B-ggml/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
Loading the extension "gallery"... Ok.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 33.95 seconds (0.38 tokens/s, 13 tokens, context 39, seed 558843024)
generate cache hit
Output generated in 14.88 seconds (1.34 tokens/s, 20 tokens, context 61, seed 1039449246)
generate cache hit
Output generated in 12.17 seconds (1.31 tokens/s, 16 tokens, context 88, seed 523733239)
Output generated in 68.35 seconds (1.08 tokens/s, 74 tokens, context 121, seed 912952673)
generate cache hit
Output generated in 31.92 seconds (1.82 tokens/s, 58 tokens, context 210, seed 1327347234)
Output generated in 66.78 seconds (0.25 tokens/s, 17 tokens, context 349, seed 1946798230)
generate cache hit
Output generated in 24.49 seconds (1.31 tokens/s, 32 tokens, context 379, seed 429283322)
generate cache hit
Output generated in 9.80 seconds (1.12 tokens/s, 11 tokens, context 420, seed 559845450)
Output generated in 77.58 seconds (0.10 tokens/s, 8 tokens, context 472, seed 1239183125)
generate cache hit
Output generated in 17.79 seconds (1.52 tokens/s, 27 tokens, context 492, seed 2013844718)
generate cache hit
Output generated in 7.60 seconds (1.32 tokens/s, 10 tokens, context 527, seed 609475087)
Output generated in 103.58 seconds (0.19 tokens/s, 20 tokens, context 564, seed 1553215150)

ghost avatar Apr 15 '23 23:04 ghost

@eiery very glad to hear!

Hopefully, the llama_state api is figured out in the base library soon and then we're really talking, then we can just restore to the longest matching saved state in an LRU cache or something.

abetlen avatar Apr 15 '23 23:04 abetlen

@gjmulder or anyone else able to test the server? It's been working on my end but want an independent confirmation.

abetlen avatar Apr 16 '23 00:04 abetlen

@eiery very glad to hear!

Hopefully, the llama_state api is figured out in the base library soon and then we're really talking, then we can just restore to the longest matching saved state in an LRU cache or something.

Having such a cache would be helpful indeed especially if you do frequent editing. You could also afford to generate multiple replies with different parameters and let the user choose which one they like best.

Honestly if that's implemented performance should be excellent until you hit the 2048 token limit and need to rotate the buffer/do tricks like summarization. I guess caching of the initial prompt will help if it's a long one but ingesting over a thousand tokens for every generation will tack on a couple minutes every time. Luckily there are smart people at llama.cpp working on that...

ghost avatar Apr 16 '23 00:04 ghost

@oobabooga @eiery Okay I've pushed the 0.1.34 release to pypi and the wheels should be building right now. This includes the new cache api. I'll keep this issue open until to track the proper cache support and close #68

abetlen avatar Apr 16 '23 02:04 abetlen

I have made a new test with llama-cpp-python==0.1.34 and I confirm that the second generation starts immediately when the cache is enabled. Very nice! I'm using it here https://github.com/oobabooga/text-generation-webui/commit/d2ea925fa5a0b83e607e67681f944d461a23ad24

oobabooga avatar Apr 16 '23 03:04 oobabooga

I grabbed this. Confirmed speeds are up when hitting cache. Good times. Getting ~1t/s on a 5950X with a 30b model compared to ~0.2t/s before. No errors so far.

I will say that I'd somewhat expect clicking the continue button to always hit cache, but that has not been the case. Not sure if it's a context order issue (the context isn't being updated after the next send, rather than after the end of the generation) or a more naive comparison method (comparing the entire context buffer to the most recent context entirely and any mismatch forces a full regen), but I would expect to cache hit clicking continue in webui, assuming no edits to existing context. That could be non-trivial, but kobold.cpp's smartcontext implementation has helped there. Different use case (maintaining world/character data at the head of the context stack), obviously, but a chunk-based cache comparison could be valuable.

I will say, I don't know enough about how/whether context compounds, so maybe keeping later chunks unchanged would be a problem if you regenerate the first context without regenerating everything after it.

digiwombat avatar Apr 16 '23 06:04 digiwombat

Can I confirm that the cache is only for the /v1/chat/completions end point and not the /v1/completions endpoint?

I gave up on using the chat completions end point as it seemed to not understand roles and users using alpacas. I'm now using alpaca-lora-30B with the completion end point which is producing better responses. :man_shrugging:

gjmulder avatar Apr 16 '23 15:04 gjmulder

@gjmulder this should actually work for both APIs, is it not working for /v1/completions?

abetlen avatar Apr 17 '23 13:04 abetlen

@abetlen I might be being stupid here... how do I tell for certain that it is enabled?

I'd need to generate the same text from the same prompt with and without caching.

gjmulder avatar Apr 17 '23 14:04 gjmulder

@gjmulder For the completion endpoint you would just need to pass in the prompt + returned text to the completion the next time you call the API.

abetlen avatar Apr 17 '23 17:04 abetlen

Currently open source models we work with are not great enough to provide clean output that has no need for correction. Is there option to keep cache not for the last generated prompt but also or instead prompt for 1 message before. This would enable user edit last response and regenerate messages in trade for a minor latency increase.

I seen idea of making two llama.cpp instances to run in parallel where one is just used to store "state" in case it's addressed and they exchange those between each other following user's actions.

Priestru avatar Apr 18 '23 06:04 Priestru

https://github.com/ggerganov/llama.cpp/pull/1105 This may be relevant

snxraven avatar Apr 21 '23 19:04 snxraven

The above was merged, we should be able to set cache as needed

snxraven avatar Apr 22 '23 09:04 snxraven

@snxraven merged in the low-level api here too. Currently working on an implementation for LlamaCache.

abetlen avatar Apr 23 '23 14:04 abetlen