llama_cpp_dart icon indicating copy to clipboard operation
llama_cpp_dart copied to clipboard

Every next question is slower to respond

Open damiandudycz opened this issue 4 months ago • 2 comments

Every next message added to chat is taking longer for llm to respond. I wonder if this is because I prepare the prompt from the whole history every time a new message is sent? Is there some better way to send next message and keep context without this outcome? I've noticed that if I generate prompt from just 2 last messages (user and empty assistant) it still works correctly, remembers previous context and works faster. Should I use this approach?

damiandudycz avatar Aug 30 '25 06:08 damiandudycz

yes, because it takes whole history as new prompt

I am looking into a way to cache history, so far I have this approach which works but I am trying to improve it https://github.com/netdur/llama_cpp_dart/blob/main/example/chat_session.dart

you can avoid reset kv reset between prompts by using low level API, it could be complex https://github.com/netdur/llama_cpp_dart/blob/main/example/cache.dart

netdur avatar Aug 30 '25 18:08 netdur

Thanks, I'll take a look into these. So far, I just started sending my last message + empty system message to llm, and it still remembers the previous context. I'm sure it might have some quirks that I haven't realized yet, but for now seems to work fine.

damiandudycz avatar Aug 30 '25 19:08 damiandudycz