Every next question is slower to respond
Every next message added to chat is taking longer for llm to respond. I wonder if this is because I prepare the prompt from the whole history every time a new message is sent? Is there some better way to send next message and keep context without this outcome? I've noticed that if I generate prompt from just 2 last messages (user and empty assistant) it still works correctly, remembers previous context and works faster. Should I use this approach?
yes, because it takes whole history as new prompt
I am looking into a way to cache history, so far I have this approach which works but I am trying to improve it https://github.com/netdur/llama_cpp_dart/blob/main/example/chat_session.dart
you can avoid reset kv reset between prompts by using low level API, it could be complex https://github.com/netdur/llama_cpp_dart/blob/main/example/cache.dart
Thanks, I'll take a look into these. So far, I just started sending my last message + empty system message to llm, and it still remembers the previous context. I'm sure it might have some quirks that I haven't realized yet, but for now seems to work fine.