llm icon indicating copy to clipboard operation
llm copied to clipboard

Implementation of prompt caching

Open setzer22 opened this issue 2 years ago • 2 comments

This was simpler than I expected, so time to sneak in a little PR :)

This implements the same feature described in https://github.com/ggerganov/llama.cpp/issues/64

Basically, this adds an API to access the memory tensors in the model, and adds a couple functions to save them and load them from disk.

I also added --cache-prompt and --restore-prompt flags for llama-cli:

  • When running with --cache-prompt mode, no inference is run, and the program dumps the memory contents into a file after feeding it the prompt.
  • When running with --restore-prompt, the contents of the memory are read from the disk at the given path, and the prompt you feed the system is concatenated, right after the cached prompt.

Note this PR builds on top of #10, so the diff will not look correct until that is merged. You have to filter to only look at the commits from this PR when you go to the "Files changed" tab.

I have tested the changes, but since this adds a non-trivial change, I'd like to make sure I didn't break anything before merging. Anyone interested please pull the branch and test it before we commit. Here's some minimal instructions.

  1. Run the model with an incomplete prompt, you can use the example below as an example:
RUSTFLAGS='-C target-feature=+avx2,+fma,+f16c'  cargo run --release -- -m /data/Llama/LLaMA/7B/ggml-model-q4_0.bin -f <path_to_prompt_text_file> --cache-prompt <path_to_cache_file>
The following text is a transcript for a conversation between a human user and The Assistant. The Assistant is a smart, reliable, caring, confident chatbot which is based on advanced AI technology and is capable of replying to the user's messages in a truthful and understanding way.

The transcript consists of an exchange of messages. User messages begin with the word USER, while The Assistant's messages start by using the word ASSISTANT.

=== BEGIN TRANSCRIPT ===
USER: Explain what a game engine is like a 5 year old
ASSISTANT:
  1. Run the command again, this time passing the --restore-cache flag and a smaller prompt
RUSTFLAGS='-C target-feature=+avx2,+fma,+f16c'  cargo run --release -- -m /data/Llama/LLaMA/7B/ggml-model-q4_0.bin -p "A game engine" --restore-prompt <path_to_cache_file> 

The observed behavior should be that the second time, the system starts predicting just where it left off, and you only pay for the time it takes to parse the extra prompt :)

setzer22 avatar Mar 15 '23 20:03 setzer22

Oh, before I forget. I also tried using the snap crate for compression. Some quick results:

  • Compression ratios for the prompt I'm sharing above look good. A cached prompt is around ~500MB, but only 250MB when compressed. However, this compression ratio may just be an artifact from having a ton of zeroes at the end of the (unfilled) memory.
  • Loading speed is negligible for uncompressed data, but was quite noticeable (~2 seconds) when compressed.

So, all things considered, I decided to not bother with compression right now. We can always add it later.

setzer22 avatar Mar 15 '23 21:03 setzer22

Looks good to me. Would this also enable continuing an existing generation that ran out before its end-of-text?

Not directly, but it would be a step in the right direction. Basically, we would need to manipulate the memory to drop the first tokens so that we make up space to continue generation.

I'm thinking, adding an API for manipulating the tokens in the memory would lead to very interesting use cases for experimentation 🤔

With such an API you could, e.g. keep the beginning of the conversation, insert some (...) tokens in the middle, then a part in the end, and you made some room to continue the generation but the model still kinda keeps the context for the conversation. We need to try this!

setzer22 avatar Mar 16 '23 08:03 setzer22

Alright, this was some major cleanup session! :smile:

As discussed, I've broken down the old infer_with_prompt function into two functions: feed_prompt and infer_next_token. This helps untangle the mess the original function was. A Model now offers a "poll"-based API. When you want to produce a token, or feed some more prompt, you just call it. It is now up to the caller to stop on "end of text" or do anything else entirely. :smile:

As part of this refactor, I've also had to move some of the things that were previously declared inside infer_with_prompt into the Model struct. Functions now have less arguments, and rely more on updating internal state. Things like last_n_tokens or n_past are now stored as part of the model.

Finally, I've refactored the Memory structs from the previous iteration of this PR into InferenceSnapshot, which stores more things than just the memory (for example, the last_n_tokens and the logits vector). This ensures inference will resume at the exact same point where it left off when restoring a snapshot.

setzer22 avatar Mar 17 '23 21:03 setzer22

great job!

xingchensong avatar May 11 '23 02:05 xingchensong