llm
llm copied to clipboard
Implementation of prompt caching
This was simpler than I expected, so time to sneak in a little PR :)
This implements the same feature described in https://github.com/ggerganov/llama.cpp/issues/64
Basically, this adds an API to access the memory tensors in the model, and adds a couple functions to save them and load them from disk.
I also added --cache-prompt and --restore-prompt flags for llama-cli:
- When running with
--cache-promptmode, no inference is run, and the program dumps the memory contents into a file after feeding it the prompt. - When running with
--restore-prompt, the contents of the memory are read from the disk at the given path, and the prompt you feed the system is concatenated, right after the cached prompt.
Note this PR builds on top of #10, so the diff will not look correct until that is merged. You have to filter to only look at the commits from this PR when you go to the "Files changed" tab.
I have tested the changes, but since this adds a non-trivial change, I'd like to make sure I didn't break anything before merging. Anyone interested please pull the branch and test it before we commit. Here's some minimal instructions.
- Run the model with an incomplete prompt, you can use the example below as an example:
RUSTFLAGS='-C target-feature=+avx2,+fma,+f16c' cargo run --release -- -m /data/Llama/LLaMA/7B/ggml-model-q4_0.bin -f <path_to_prompt_text_file> --cache-prompt <path_to_cache_file>
The following text is a transcript for a conversation between a human user and The Assistant. The Assistant is a smart, reliable, caring, confident chatbot which is based on advanced AI technology and is capable of replying to the user's messages in a truthful and understanding way.
The transcript consists of an exchange of messages. User messages begin with the word USER, while The Assistant's messages start by using the word ASSISTANT.
=== BEGIN TRANSCRIPT ===
USER: Explain what a game engine is like a 5 year old
ASSISTANT:
- Run the command again, this time passing the
--restore-cacheflag and a smaller prompt
RUSTFLAGS='-C target-feature=+avx2,+fma,+f16c' cargo run --release -- -m /data/Llama/LLaMA/7B/ggml-model-q4_0.bin -p "A game engine" --restore-prompt <path_to_cache_file>
The observed behavior should be that the second time, the system starts predicting just where it left off, and you only pay for the time it takes to parse the extra prompt :)
Oh, before I forget. I also tried using the snap crate for compression. Some quick results:
- Compression ratios for the prompt I'm sharing above look good. A cached prompt is around ~500MB, but only 250MB when compressed. However, this compression ratio may just be an artifact from having a ton of zeroes at the end of the (unfilled) memory.
- Loading speed is negligible for uncompressed data, but was quite noticeable (~2 seconds) when compressed.
So, all things considered, I decided to not bother with compression right now. We can always add it later.
Looks good to me. Would this also enable continuing an existing generation that ran out before its end-of-text?
Not directly, but it would be a step in the right direction. Basically, we would need to manipulate the memory to drop the first tokens so that we make up space to continue generation.
I'm thinking, adding an API for manipulating the tokens in the memory would lead to very interesting use cases for experimentation 🤔
With such an API you could, e.g. keep the beginning of the conversation, insert some (...) tokens in the middle, then a part in the end, and you made some room to continue the generation but the model still kinda keeps the context for the conversation. We need to try this!
Alright, this was some major cleanup session! :smile:
As discussed, I've broken down the old infer_with_prompt function into two functions: feed_prompt and infer_next_token. This helps untangle the mess the original function was. A Model now offers a "poll"-based API. When you want to produce a token, or feed some more prompt, you just call it. It is now up to the caller to stop on "end of text" or do anything else entirely. :smile:
As part of this refactor, I've also had to move some of the things that were previously declared inside infer_with_prompt into the Model struct. Functions now have less arguments, and rely more on updating internal state. Things like last_n_tokens or n_past are now stored as part of the model.
Finally, I've refactored the Memory structs from the previous iteration of this PR into InferenceSnapshot, which stores more things than just the memory (for example, the last_n_tokens and the logits vector). This ensures inference will resume at the exact same point where it left off when restoring a snapshot.
great job!