llama.cpp
llama.cpp copied to clipboard
main: add the possibility to open the prompt cache read-only
The prompt cache constitutes a nice speed up when using the same prompt prefix across multiple evaluations, but when using it, it will also be updated, which is not always desirable. One use case is to have a large prompt containing some context and usage rules, and a second part containing variable data of the problem being studied. In this case it's desirable to be able to save the first part once, and to always reuse it as-is without updating it with the second part.
The new argument --prompt-cache-ro enables this read-only mode on the prompt cache. The prompt's contents that match the cache are loaded from the cache but the rest is not modified. This allowed to reduce a total analysis time from 112s to 49.7s here, without having to backup and restore a copy of the prompt, which takes significant time at 500 MB.
LGTM! I'll hold off on the accept for now in case someone else has objections. One thought: --prompt-cache-all doesn't seem to make sense in conjunction with this new option; I wonder if we should fail or warn if used together?
Thanks. Regarding the incompatibility with --prompt-cache-all, it's just a matter of taste. If there's demand for this, I can update the patch for this. I just tend to think (as a user) that when options combinations cause failures, they're harder to use from scripts, which have to replicate the internal logic.
Is it possible use mmap to speed up cache loading to instant? Prompt cache file will be in OS filesystem cache already, in cases when it was created right before.
Use-case: generate MANY variants of answers to the same question by caching prompt and then executing several times with read-only cache!
Also, pingback: https://github.com/ggerganov/llama.cpp/issues/1585
I guess it could indeed help a bit, though the files are not huge, and for me the total processing time is largely dominated by the complementary prompt time and the eval time. Also I have no idea if the cache is used as-is or is consumed and transformed on the fly, though I would imagine the latter. In such a case, it's likely that mmap will not bring anything (well, it could save a data duplication hence a few hundred MB of cache RAM, that's all).
For me, cache files are often becoming around 1 Gb. If there is some other processing (reading of structs, for example) that can be optimized to accept entire cache mapped in memory (not only parts of it with additional file reading) – it will further increase loading speed. (I didn't looked at the code myself too…)
Hard to say how much this will make a difference but possibly with bigger prompts on smaller systems it could be meaningful? Just to observe that, with PR #1461 (nvidia-docker support) pending integration there is now realistic prospect of integration of llama.cpp into cloud-based nvidia-docker micro-services and it seems to me very plausible that one use of these would be to expose an API fronted service which had did a focussed things (perhaps with a complex static prompt pre-fix, followed by a variable prompt section provided by the request..). For any such micro-service the query throughput performance really matters a lot and so this sort of time-saver (even if it seems quite small) might start to be quite significant.
So @ejones I'm keen for this (and #1461) to be merged if they are not colliding with any critical functionality and nobody has raised any objections (or at least if you might gently nudge those who might object and are busy - so that the author can make any further adjustments.. (pretty please :-) ).
Taking a strategic view, it seems to me that having things which provide function 'at the edges', if those are suitable for the project and not considered out-of-scope, should once they are code-compatible/quality-standard ideally be merged sooner rather than later.. on the basis that the merge will allow for any indirect interference to be spotted sooner and maybe fixed with the input of the PR author. If things delay too long I can imagine eventually something in the core changing and ending up stopping a merged when eventually team has a chance to look again at paused PRs and worst case the original author has gone and is no longer being available to do any tidy up placing burden on core team to do such work or else the functionality is lost..
Ah, you're right, sorry. This fell off my radar.
Re: mmap, I think that's a reasonable direction. When I implemented the session/prompt cache I just didn't have the confidence in doing an mmap approach. And yeah, there is a transformation currently where the KV cache is compressed, so an mmap approach might involve a tradeoff where the KV cache is persisted as is, at its maximum size. Just speculating here.
Another use-case: having several pre-made system prompts to orchestrate llama's responses.
For example, I want to take a user action string in plain text, separate it to distinct commands, then process each command according to its essence, and finally decide, how to execute them. (Something like hypothetical voice control over smart-house devices, a part of bigger pipeline).
I tried to do proof-of-concept of this, and I had to write HUGE prompts (more than 1000 tokens), describing what I want from model on each step. For example, first prompt would be about separating free-form phrases to concise abstract commands by lines; then caller script would take each line and ask the second prompt on each – this time, asking model about the nature of this command (is it executing action, getting status, etc.), making sure is was really a command and not a jailbreak prank from a user.
In the next prompt I tried to give the model a list of known devices to ask, which one was mentioned in the user command – but this failed badly (7B and 13B cannot work on large ordered lists reliably), looks like I should execute the prompt for each available device, asking "is it about this one, YES or NO?" to shorten the resulting list of possible devices, and then asking again with another prompt.
This already requires to execute inference many-many times to sort-out everything and get a meaningful and reliable response! It can be repeated further on the same input to increase confidence.
I repeat: this use-case consists of switching between several large constant prompts (which should be cached), that would be used each time with added small arbitrary input lines (that should not attempt to cache, which only wastes time). Here mmap would allow instant loading, provided there is enough RAM for filesystem to cache them too.