ollama
ollama copied to clipboard
Context in /api/generate response grows too big.
What is the issue?
I'm coding my own Chat UI for Ollama and using context feature to implement dialog mode. So every time Ollama generates a response the returned context (embeddings) is saved into chat object. On the next prompt this context is passed into /api/generate then after response resulting context is saved into chat object again.
After upgrading to latest Ollama I've noticed generation speed degraded considerably and the context returned by /api/generate grows too fast compared to previous versions.
Looks like it doubles context size after each generation and soon in relatively small chat with 26 messages it becomes like 3-7Mb in size which causes my UI being unresponsive and also browser freezes because it has to process such a huge amount of data (mostly for debugging like converting JSON to string, but this is not normal anyway). When earlier (at least for the 0.2.1 version I've used) it could be around 8-16Kb which is totally fine and also fits model capacity.
This is pretty hard to measure (and I don't know how to) but I've also noticed that with latest Ollama newer models like gemma2 or llama3.1 do not adhere to context as well as some older models like mistral on earlier Ollama version. This could be related to a context changes, which was broken since 0.2.2 then response was fixed but it looks like the fix was not completely correct.
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.3.0