llama.cpp server: avoid full prompt eval when 'prompt >= ctx'

server: avoid full prompt eval when 'prompt >= ctx'

Open prfd opened this issue 10 months ago • 0 comments

When using the server for multi-turn chat, soon or later the prompt is going to surpass the context size, the current approach truncate the prompt by half of the context size excluding n_keep:

https://github.com/ggerganov/llama.cpp/blob/192090bae47960f0d38d4967abe398a5d190057e/examples/server/server.cpp#L1969-L1983

By doing that, common_part is going to match only n_keep tokens (when cache_prompt: true):

https://github.com/ggerganov/llama.cpp/blob/192090bae47960f0d38d4967abe398a5d190057e/examples/server/server.cpp#L2011-L2016

Technically, this is not a full prompt eval, n_keep is not revaluated, but it would be better to avoid this if possible, specially because prompt eval is slow on CPU.

Apr 23 '24 21:04 prfd

llama.cpp llama.cpp copied to clipboard

server: avoid full prompt eval when 'prompt >= ctx'

llama.cpp
llama.cpp copied to clipboard