ollama Context in /api/generate response grows too big.

Context in /api/generate response grows too big.

Open slouffka opened this issue 1 year ago • 1 comments

What is the issue?

I'm coding my own Chat UI for Ollama and using context feature to implement dialog mode. So every time Ollama generates a response the returned context (embeddings) is saved into chat object. On the next prompt this context is passed into /api/generate then after response resulting context is saved into chat object again.

After upgrading to latest Ollama I've noticed generation speed degraded considerably and the context returned by /api/generate grows too fast compared to previous versions.

Looks like it doubles context size after each generation and soon in relatively small chat with 26 messages it becomes like 3-7Mb in size which causes my UI being unresponsive and also browser freezes because it has to process such a huge amount of data (mostly for debugging like converting JSON to string, but this is not normal anyway). When earlier (at least for the 0.2.1 version I've used) it could be around 8-16Kb which is totally fine and also fits model capacity.

This is pretty hard to measure (and I don't know how to) but I've also noticed that with latest Ollama newer models like gemma2 or llama3.1 do not adhere to context as well as some older models like mistral on earlier Ollama version. This could be related to a context changes, which was broken since 0.2.2 then response was fixed but it looks like the fix was not completely correct.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.0

Jul 26 '24 10:07 slouffka

ollama ollama copied to clipboard

Context in /api/generate response grows too big.

What is the issue?

OS

GPU

CPU

Ollama version

ollama
ollama copied to clipboard