llama.cpp [Feature Request] --prompt-cache-all + user input

I noticed the ' --prompt-cache-all' and ' --prompt-cache' for '--session' replacement, but the ' --prompt-cache-all' does not support user input, why not? and why not only store tokens in the context window, I would like to resume input/output with the model from a file this would be sort of like persistent memory, and would be awesome!

May 11 '23 00:05 skidd-level-100

Yeah, we punted on --prompt-cache-all in interactive mode because of the complexities of properly saving the session file on various exit paths. But it does support input in the sense of appending to the prompt in successive calls to ./main. My plan for this is to support things like long-running and persistent chats, just with main invoked for each message (which is now fast with the cache) and managing context outside main.

May 11 '23 03:05 ejones

@ejones I confused about this as well, could please kindly provide example how to use prompt cache all, without using interactive mode ? Can I use it for like :

generate html page
modify the generated page
another modify that generated page

Or this is different than what I thought ?

Thanks

May 11 '23 21:05 x4080

so what your saying is that I can quickly hack up a bash script and have pseudo persistent bot?

May 11 '23 22:05 skidd-level-100

At a basic level, the way to leverage this is to feed back the output of one call to ./main as the prompt to the next call, optionally appending additional input. #1338 has an example of that in the testing section. That said, there are some additional considerations, including that it's up to the caller to ensure the prompt doesn't exceed the context size (in the long run I believe this will be preferable). I'm hoping to put up an Bash example of chat using prompt caches instead of --interactive that will illustrate this.

May 12 '23 19:05 ejones

@ejones Thanks for the link, I'll test it with my own prompt, so its useful for generating story

May 12 '23 22:05 x4080

@ejones I wondered why --prompt-cache-all not saving the last message generated by the LLM, instead we have to put the last message again, is it better that it saves also the message generated by the LLM, so that in back-forth chat session, we can just add another question and not put / copy he last message generated by the LLM

Sorry if I'm wrong with this 😄

May 13 '23 00:05 x4080

Yeah, I tried a version where it restored and appended to the saved prompt, but I didn't want to have to rely on the contents of the prompt cache. There's no way to inspect prompt caches (yet) and there may be cases where they don't get saved or get corrupted. So for now, the prompt argument is the source of truth and the prompt cache is just a cache.

The use case I envision for this is for a script / app to manage the chat session etc. rather than repeatedly invoking main on the command line. The example I'm preparing now will illustrate this.

May 16 '23 04:05 ejones

ok @ejones thanks

May 16 '23 04:05 x4080

Got a PR up for persistent chat: #1495. Note that it depends on #1032, still open.

May 17 '23 03:05 ejones

@ejones Looks like the PRs have been merged. Could you explain how to use this new feature?

May 23 '23 04:05 vbguyny

@ejones Looks like the PRs have been merged. Could you explain how to use this new feature?

Good one, I'm interested in it too

May 23 '23 07:05 x4080

For the persistent chat script, I have a PR up at #1568 with docs on its usage. For the --prompt-cache and --prompt-cache-all, the basic idea is to run ./main with those options specified, save the output (e.g., in a file or variable), append your next input, and repeat. If you do this, main should only need to evaluate from the new input onwards. Note that if you do this indefinitely, you need to track the size of the prompt and make sure it doesn't exceed the size of the context.

This usage is demonstrated in examples/chat-persistent.sh, although it might be possible to come up with an even more minimal example. I've been considering an agent-style action/observation loop example.

May 23 '23 17:05 ejones

@ejones is it now no need to add the last output from llama for the new request using --prompt-cache-all ?

May 23 '23 20:05 x4080

@divinity76 , I do not believe it is as simple as your pseudo-example would suggest. For example, when I run:

./main -ngl 84 -m models/llama-2-7b.Q4_0.gguf --color -c 4096 -n 40 -s 42 --temp 0.7 --repeat_penalty 1.1 -r "User:" --prompt-cache cache.prompt.bin --prompt-cache-all -f ./prompts/chat-with-bob.txt

I would expect to be able to append a prompt to the cached prompt. The cache.prompt.bin file does get created and is confirmed with main: saving final output to session file 'cache.prompt.bin'. However, when I run:

./main -ngl 84 -m models/llama-2-7b.Q4_0.gguf --color -c 4096 -n 40 -s 42 --temp 0.7 --repeat_penalty 1.1 -r "User:" --prompt-cache cache.prompt.bin --prompt-cache-all -p "What is your first name?"

The cached prompt is not loaded and there is no previous context for the response to correctly answer, "Bob.". The script outputs:

main: attempting to load saved session from 'cache.prompt.bin' main: loaded a session with prompt size of 0 tokens

Feb 22 '24 17:02 marknuppnau

oh ok sorry, i may be wrong and i don't have time to investigate, nevermind

Feb 22 '24 21:02 divinity76

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 09 '24 01:04 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

[Feature Request] --prompt-cache-all + user input

llama.cpp
llama.cpp copied to clipboard