llama.cpp
llama.cpp copied to clipboard
[Feature Request] --prompt-cache-all + user input
I noticed the ' --prompt-cache-all' and ' --prompt-cache' for '--session' replacement, but the ' --prompt-cache-all' does not support user input, why not? and why not only store tokens in the context window, I would like to resume input/output with the model from a file this would be sort of like persistent memory, and would be awesome!
Yeah, we punted on --prompt-cache-all in interactive mode because of the complexities of properly saving the session file on various exit paths. But it does support input in the sense of appending to the prompt in successive calls to ./main. My plan for this is to support things like long-running and persistent chats, just with main invoked for each message (which is now fast with the cache) and managing context outside main.
@ejones I confused about this as well, could please kindly provide example how to use prompt cache all, without using interactive mode ? Can I use it for like :
- generate html page
- modify the generated page
- another modify that generated page
Or this is different than what I thought ?
Thanks
so what your saying is that I can quickly hack up a bash script and have pseudo persistent bot?
At a basic level, the way to leverage this is to feed back the output of one call to ./main as the prompt to the next call, optionally appending additional input. #1338 has an example of that in the testing section. That said, there are some additional considerations, including that it's up to the caller to ensure the prompt doesn't exceed the context size (in the long run I believe this will be preferable). I'm hoping to put up an Bash example of chat using prompt caches instead of --interactive that will illustrate this.
@ejones Thanks for the link, I'll test it with my own prompt, so its useful for generating story
@ejones I wondered why --prompt-cache-all not saving the last message generated by the LLM, instead we have to put the last message again, is it better that it saves also the message generated by the LLM, so that in back-forth chat session, we can just add another question and not put / copy he last message generated by the LLM
Sorry if I'm wrong with this 😄
Yeah, I tried a version where it restored and appended to the saved prompt, but I didn't want to have to rely on the contents of the prompt cache. There's no way to inspect prompt caches (yet) and there may be cases where they don't get saved or get corrupted. So for now, the prompt argument is the source of truth and the prompt cache is just a cache.
The use case I envision for this is for a script / app to manage the chat session etc. rather than repeatedly invoking main on the command line. The example I'm preparing now will illustrate this.
ok @ejones thanks
Got a PR up for persistent chat: #1495. Note that it depends on #1032, still open.
@ejones Looks like the PRs have been merged. Could you explain how to use this new feature?
@ejones Looks like the PRs have been merged. Could you explain how to use this new feature?
Good one, I'm interested in it too
For the persistent chat script, I have a PR up at #1568 with docs on its usage. For the --prompt-cache and --prompt-cache-all, the basic idea is to run ./main with those options specified, save the output (e.g., in a file or variable), append your next input, and repeat. If you do this, main should only need to evaluate from the new input onwards. Note that if you do this indefinitely, you need to track the size of the prompt and make sure it doesn't exceed the size of the context.
This usage is demonstrated in examples/chat-persistent.sh, although it might be possible to come up with an even more minimal example. I've been considering an agent-style action/observation loop example.
@ejones is it now no need to add the last output from llama for the new request using --prompt-cache-all ?
@divinity76 , I do not believe it is as simple as your pseudo-example would suggest. For example, when I run:
./main -ngl 84 -m models/llama-2-7b.Q4_0.gguf --color -c 4096 -n 40 -s 42 --temp 0.7 --repeat_penalty 1.1 -r "User:" --prompt-cache cache.prompt.bin --prompt-cache-all -f ./prompts/chat-with-bob.txt
I would expect to be able to append a prompt to the cached prompt. The cache.prompt.bin file does get created and is confirmed with main: saving final output to session file 'cache.prompt.bin'. However, when I run:
./main -ngl 84 -m models/llama-2-7b.Q4_0.gguf --color -c 4096 -n 40 -s 42 --temp 0.7 --repeat_penalty 1.1 -r "User:" --prompt-cache cache.prompt.bin --prompt-cache-all -p "What is your first name?"
The cached prompt is not loaded and there is no previous context for the response to correctly answer, "Bob.". The script outputs:
main: attempting to load saved session from 'cache.prompt.bin' main: loaded a session with prompt size of 0 tokens
oh ok sorry, i may be wrong and i don't have time to investigate, nevermind
This issue was closed because it has been inactive for 14 days since being marked as stale.