llama-cpp-python
llama-cpp-python copied to clipboard
Implement caching for evaluated prompts
The goal of this feature is to reduce latency for repeated calls to the chat_completion api by saving the kv_cache keyed by the prompt tokens.
The basic version of this is to simply save the kv_state after the prompt is generated.
Additionally we should investigate if it's possible save and restore the kv_state after the completion has been generated as well.
I don't understand why we don't just use interactive mode. Almost all the users coming from llama.CPP Are used to an interface where they send a message and get a quick response because there is no clearing of state between messages from the user, meaning there's also no need to reload the state. As I understood it, the KV cache was a way to store the prompt state, because it is used multiple times during the course of a conversation and helps improve responsiveness during long conversations. Given the way that it's used in the base llama.CPP executable, and the fact that in the current implementation of interactive mode, storing of the entire conversation state won't improve performance (it would only allow continuation of previous conversations during a different session), I don't know that this is something that they're going to add in the immediate future.
For me, being able to get completions from the bot with full context of the ongoing conversation is my main use case. So there's pretty much no situation where I would want the current conversation context cleared or reset.
And I thought this was similar for the openAI implementation, where you send the current message, but don't need to send the full message history. Any type of recomputation or loading of model state decreases the performance and makes it slower than the base llama.CPP implementation imo.
After all, I think if people are using chat mode, from a user perspective, they want a continuous and performant chat. Even if that means running models and independent contexts simultaneously, which reduces scalability in the short term without the ability to load and save the states.
@MillionthOdin16 are you talking about the OpenAI server or just using the Llama class? For the actual OpenAI API each request is entirely independent of all other requests (e.g. you always send the full history to the /chat/completions endpoint) so you do need to reset the model each time. This is why I'm looking into the KV state solution so we can just reload the state if we've seen e.g the first n-1 messages of an n message chat get sent over.
If you're just looking for interactive mode though I believe that's been implemented in this example if you just want to use it in a program and don't care about the API https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/low_level_api_chat_cpp.py
I'm talking about the openAI server. My point is from the user perspective, the most important factor of chat completions is speed of responses. Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is. So for the chat completions endpoint, this creates an issue because the longer the conversation is, the longer it's going to take to get a response. The reason we have the issue is because the CPP implementation of llama is different than the normal GPU implementations relating to processing of the prompt before it generates a response.
So I'm saying that the most efficient solution in this instance might be to not clear the context and save that processing time for each subsequent completion by keeping the session going. It diverges from how openAI implements it, but it's the only option we have right now. And chat completions isn't usable without it because it's too slow.
I'm basically advocating for a temporary hack to prevent the context from being cleared during chat completions so that we get a significant performance boost, until either we get a proper state saving capability, or the prompt processing time issue is resolved.
The issue is frustrating because we're so close to having an API that is performant and chat capable, but there's just a couple things holding it back, and I'm advocating for a temporary hack to allow good performance, until we can actually properly implement it.
Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is
@MillionthOdin16 Is that still meaningfully so since the recent performance regressions appear to have been fixed?
- Fix proposed: https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1497569526
- https://github.com/ggerganov/llama.cpp/pull/775
- Fix seemingly confirmed: https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1498024558
- Potentially also resolved by the above: https://github.com/ggerganov/llama.cpp/issues/677#issuecomment-1502594077
- Potentially also resolved by the above: https://github.com/ggerganov/llama.cpp/issues/735#issuecomment-1502595371
Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is
@MillionthOdin16 Is that still meaningfully so since the recent performance regressions appear to have been fixed?
Fix proposed: Performance Discrepancy: gpt4all Faster than Optimized llama.cpp ggerganov/llama.cpp#603 (comment)
Fix seemingly confirmed: Performance Discrepancy: gpt4all Faster than Optimized llama.cpp ggerganov/llama.cpp#603 (comment)
Potentially also resolved by the above: Alpaca model is running very slow in llama.cpp compared to alpaca.cpp ggerganov/llama.cpp#677 (comment)
Potentially also resolved by the above: I'm pegging CPU (
./examples/chat.sh
works very slowly) on a 5800X3D / u22 linux, anything that can be done? ggerganov/llama.cpp#735 (comment)
So that's not what I mean in this case. I created issue 603 on llama.cpp and now that we have that performance boost, it would be awesome to get as much boost in the API over here as we can. I meant the issue/undetermined cause here: ggerganov/llama.cpp#719
I've seen people more familiar with LLMs mention some oddities about the initial processing in the past, but haven't seen a straightforward explanation. As I understand, it appears llama.cpp has some differences between how it processes the initial data before generating tokens, and it's much slower than the transformer's implementation (CPU vs GPU aside).
So I was just saying that if we get a performance boost from that avenue, as well as the ability to store a conversation's state, a proper implementation will be much faster than atm, and we wouldn't need this workaround. Hope that makes sense.
Right now there's so many small, different issues going on in the main repo that it's hard to keep track of haha.
+1 to this. Many people are requesting this feature here: https://github.com/oobabooga/text-generation-webui/issues/866
It would be nice to have a use_cache=True
flag (or something similar) in Llama.generate
.
@oobabooga I'm still working on the caching API but for now I've added a reset
option for Llama.generate
which defaults to False
, if you want to continue from the end of a previous generation just call with reset=False
.
@abetlen is it safe to use reset=False
at all times, or will that cause incorrect generations if you completely replace the prompt with a new one?
@oobabooga What reset=False
basically does it it maintains the current model context. So once you feed in a prompt with reset=False
the prompt remains inside Llama.cpp and will be kept in memory for faster generation of new output.
E.g. You feed in the prompt "My favorite season is", the model replies "spring because everything is growing. I like to walk". Generation stops. You feed in to generate "outside in the spring weather" (not the full prompt!) and to the model the full prompt is now "My favorite season is spring because everything is growing. I like to walk outside in the spring weather".
I tested this in your own webui :grin: by setting reset=False
and doing generations in notebook mode, if I clear the textbox the AI still continues chatting as if the original text was still there. The text is also generated right away with no delay as it doesn't need to digest the prompt again!
This is getting a bit off-topic but to implement this in the webui I think the easiest way would be to save the prompt right before sending it to generate. Then when the user calls generate again you can compare the saved prompt with the user's prompt to see if they are merely appending to the existing prompt or are editing it. If it is an append call with reset=False
and send only the appended text over, otherwise send everything over and force a reset.
for now I've added a
reset
option forLlama.generate
which defaults toFalse
, if you want to continue from the end of a previous generation just call withreset=False
.
Will this be exposed through the REST API at some point?
@oobabooga @eiery @gjmulder this is now pushed to main, just looking for someoone to test generate
and __call__
before publishing the PyPI release, the code is a bit of a mess right now but the interface should remain the same.
The process to set the cache from code is
llama = llama_cpp.Llama(...)
llama.set_cache(llama_cpp.LlamaCache)
then you can call generate
and __call__
as usual and if the prompt contains the previously generated tokens or the previously returned string the cache will just continue from after those tokens / bytes.
If you're using the REST server it's enough to set the CACHE=1
environment variable.
If it works like you guys expect I'll publish to PyPI tonight or tomorrow.
$ CACHE=1 python3 -m llama_cpp.server
?
@abetlen I have made a test where I generated 80 tokens, and then generated another 80 tokens on top of the result without modifying the prompt. These were the results:
With self.model.set_cache(LlamaCache)
:
Output generated in 51.90 seconds (1.54 tokens/s, 80 tokens, context 761, seed 559310896)
generate cache hit
Output generated in 61.77 seconds (1.30 tokens/s, 80 tokens, context 841, seed 1141193019)
Without set_cache:
Output generated in 51.90 seconds (1.54 tokens/s, 80 tokens, context 761, seed 1808425735)
Output generated in 55.96 seconds (1.43 tokens/s, 80 tokens, context 841, seed 1321984670)
I can see that there is a new generate cache hit
message, but I don't seem to get any performance improvement. Not sure if I am doing this correctly.
$ CACHE=1 python3 -m llama_cpp.server
?
@gjmulder Correct
@oobabooga woops, for generate I implemented the check but didn't actually remove the old tokens from the list of tokens to eval. Should be fixed now.
@abetlen Caching is working well for me in your latest release :confetti_ball: .
I'm running it using a modified oobabooga UI with self.model.set_cache(LlamaCache)
set and generation starts instantly with no ingestion delay if the previous text is not altered. If the text is altered we get a cache miss and regenerate fully with no issues. The performance increase with caching is huge as seen below.
Loading llama-13B-ggml...
llama.cpp weights detected: models/llama-13B-ggml/ggml-model-q4_0.bin
llama.cpp: loading model from models/llama-13B-ggml/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 73.73 KB
llama_model_load_internal: mem required = 9807.47 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Loading the extension "gallery"... Ok.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 33.95 seconds (0.38 tokens/s, 13 tokens, context 39, seed 558843024)
generate cache hit
Output generated in 14.88 seconds (1.34 tokens/s, 20 tokens, context 61, seed 1039449246)
generate cache hit
Output generated in 12.17 seconds (1.31 tokens/s, 16 tokens, context 88, seed 523733239)
Output generated in 68.35 seconds (1.08 tokens/s, 74 tokens, context 121, seed 912952673)
generate cache hit
Output generated in 31.92 seconds (1.82 tokens/s, 58 tokens, context 210, seed 1327347234)
Output generated in 66.78 seconds (0.25 tokens/s, 17 tokens, context 349, seed 1946798230)
generate cache hit
Output generated in 24.49 seconds (1.31 tokens/s, 32 tokens, context 379, seed 429283322)
generate cache hit
Output generated in 9.80 seconds (1.12 tokens/s, 11 tokens, context 420, seed 559845450)
Output generated in 77.58 seconds (0.10 tokens/s, 8 tokens, context 472, seed 1239183125)
generate cache hit
Output generated in 17.79 seconds (1.52 tokens/s, 27 tokens, context 492, seed 2013844718)
generate cache hit
Output generated in 7.60 seconds (1.32 tokens/s, 10 tokens, context 527, seed 609475087)
Output generated in 103.58 seconds (0.19 tokens/s, 20 tokens, context 564, seed 1553215150)
@eiery very glad to hear!
Hopefully, the llama_state api is figured out in the base library soon and then we're really talking, then we can just restore to the longest matching saved state in an LRU cache or something.
@gjmulder or anyone else able to test the server? It's been working on my end but want an independent confirmation.
@eiery very glad to hear!
Hopefully, the llama_state api is figured out in the base library soon and then we're really talking, then we can just restore to the longest matching saved state in an LRU cache or something.
Having such a cache would be helpful indeed especially if you do frequent editing. You could also afford to generate multiple replies with different parameters and let the user choose which one they like best.
Honestly if that's implemented performance should be excellent until you hit the 2048 token limit and need to rotate the buffer/do tricks like summarization. I guess caching of the initial prompt will help if it's a long one but ingesting over a thousand tokens for every generation will tack on a couple minutes every time. Luckily there are smart people at llama.cpp working on that...
@oobabooga @eiery Okay I've pushed the 0.1.34 release to pypi and the wheels should be building right now. This includes the new cache api. I'll keep this issue open until to track the proper cache support and close #68
I have made a new test with llama-cpp-python==0.1.34
and I confirm that the second generation starts immediately when the cache is enabled. Very nice! I'm using it here https://github.com/oobabooga/text-generation-webui/commit/d2ea925fa5a0b83e607e67681f944d461a23ad24
I grabbed this. Confirmed speeds are up when hitting cache. Good times. Getting ~1t/s on a 5950X with a 30b model compared to ~0.2t/s before. No errors so far.
I will say that I'd somewhat expect clicking the continue button to always hit cache, but that has not been the case. Not sure if it's a context order issue (the context isn't being updated after the next send, rather than after the end of the generation) or a more naive comparison method (comparing the entire context buffer to the most recent context entirely and any mismatch forces a full regen), but I would expect to cache hit clicking continue in webui, assuming no edits to existing context. That could be non-trivial, but kobold.cpp's smartcontext implementation has helped there. Different use case (maintaining world/character data at the head of the context stack), obviously, but a chunk-based cache comparison could be valuable.
I will say, I don't know enough about how/whether context compounds, so maybe keeping later chunks unchanged would be a problem if you regenerate the first context without regenerating everything after it.
Can I confirm that the cache is only for the /v1/chat/completions
end point and not the /v1/completions
endpoint?
I gave up on using the chat completions end point as it seemed to not understand roles and users using alpacas. I'm now using alpaca-lora-30B
with the completion end point which is producing better responses. :man_shrugging:
@gjmulder this should actually work for both APIs, is it not working for /v1/completions
?
@abetlen I might be being stupid here... how do I tell for certain that it is enabled?
I'd need to generate the same text from the same prompt with and without caching.
@gjmulder For the completion endpoint you would just need to pass in the prompt + returned text
to the completion the next time you call the API.
Currently open source models we work with are not great enough to provide clean output that has no need for correction. Is there option to keep cache not for the last generated prompt but also or instead prompt for 1 message before. This would enable user edit last response and regenerate messages in trade for a minor latency increase.
I seen idea of making two llama.cpp instances to run in parallel where one is just used to store "state" in case it's addressed and they exchange those between each other following user's actions.
https://github.com/ggerganov/llama.cpp/pull/1105 This may be relevant
The above was merged, we should be able to set cache as needed
@snxraven merged in the low-level api here too. Currently working on an implementation for LlamaCache
.