Evan Jones
Evan Jones
@ggerganov that makes sense, thanks! I did notice there was a long stream of zeroes in the files. @Priestru not sure, perhaps there's a higher cost to the initial token(s)...
LGTM! I'll hold off on the accept for now in case someone else has objections. One thought: `--prompt-cache-all` doesn't seem to make sense in conjunction with this new option; I...
Ah, you're right, sorry. This fell off my radar. Re: mmap, I think that's a reasonable direction. When I implemented the session/prompt cache I just didn't have the confidence in...
I poked a bit at this this morning, tried increasing the copy ctx size slightly but that doesn't seem to be the issue. It does seem like the new tensor...
This seems to get the prompt cache working at least: ``` diff --git a/ggml.c b/ggml.c index 34212b8..62ac19f 100644 --- a/ggml.c +++ b/ggml.c @@ -5975,12 +5975,12 @@ struct ggml_tensor * ggml_view_3d(...
I agree this is surprising. I believe the newline stripping happens [here](https://github.com/ggerganov/llama.cpp/blob/7d873811f31d4d8c909015c946a862c0089cda7d/examples/common.cpp#L146-L148) when handling the `--file` argument. My impression is this was done to simplify chat-style prompts stored in files,...
This is super cool. As maybe a future direction, I've been wondering if things like repetition penalty, logit bias, this, and reverse prompt/stop words can all be generalized as something...
@deep-pipeline LMQL looks really cool! Their [model serving process](https://docs.lmql.ai/en/stable/language/models.html#running-lmql-with-transformers) approach for local models would likely translate to `llama.cpp`.
Yeah, we punted on `--prompt-cache-all` in interactive mode because of the complexities of properly saving the session file on various exit paths. But it does support input in the sense...
At a basic level, the way to leverage this is to feed back the output of one call to `./main` as the prompt to the next call, optionally appending additional...