[Roadmap] Context token Control
Why API usage is paid for by tokens and is stateless. Currently there is only control for output size and the conversation is sent for context as input. On a 32k token model the costs increase fast when you just need it to remember the last chat such as when using it for code development.
Concise description When using API calls to Mistral, Google, or OpenAI I can control the output context but not the input. Checking my token usage as the conversation grows there is an increase in cost per message due to the input tokens becoming the size of the max token. (I.E. 28K input for a 4K output)
Requirements Add another slider for input tokens next to the output token slider that determines how much context to send in the API call. It can default to max for new users, but more advanced users can adjust it to reduce their API bills.
The complexity is to decide how to pick what to send within that input content. Assume you have a fixed budget of 1000 tokens, but the message you are typing is 1200 (excluding chat history). How would the app make the decision on what to put/truncate?
It would just take the last 1000 tokens. In reality you have on all the models a 16k or larger token limit so you have enough for the message and the issue is the chat history as usually the previous chat is all that is needed. If you wanted you could specify how many previous chats to send, but I think that would be more work than having more advanced users set the input token limit.
Deciding what to omit is key. For instance, the system prompt is very important. Should not be the first to be cut.
To my knowledge this problem (context stuffing given constraints) hasn't been solved satisfactorily by anyone yet.
The issues is that to select what to to omit from the context you need some sort of intelligence: either Human (a person picks which messages to exclude from the context) or machine (embeddings, or better a smaller gpt network).
No simple issue to your request. I'm leaning towards empowering the user to manually choose what to exclude from the input of the llm. And maybe have a button to suggest what to remove (but again, that requires intelligence).
I completely agree with the strategy of empowering users to control the manipulation of input in these scenarios.
Some thoughts that occurred to me while reading this:
Regarding the Intelligence aspect, I believe this could either enhance or be enhanced by a Condenser: https://github.com/enricoros/big-AGI/issues/292
For a straightforward approach, the Condenser might be utilized on an on-demand or opt-in basis, triggered by a pre-set threshold or governed by rule-based logic (for example, "condense every 10 turns").
Additionally, the Condenser engine could be specifically designed for this context, offering capabilities for detection and pruning in addition to condensation. This would enable it to automatically identify moments during a conversation when condensation might be beneficial.
In trying to make a game, I need a way for the context window to move with the conversation (or game) and 'forget' the oldest, while keeping the system prompt at the beginning. It could just be:
-System prompt -Last X number of tokens in conversation -The user's message
sent to the API
This would also allow different models with different context lengths to view a larger conversation without getting an error, and limiting the cost of higher end models in those longer conversations.
Thanks for the use case @D2lod; there are various approaches to do this - the Token-bases estimation could be a bit risky as many APIs don't expose a tokenizer, some require HTTP roundtrips (e.g. Anthropic) and in general there are hidden costs in composing messages -- Big-AGI tries to be the most accurate, but we can't know how OpenAI/others change up the messages.
What do you think of the following:
- would you consider a Last-N-message approach instead of total tokens?
- where would you set this "sliding window" in the UI?
- is this something you'd activate per-Chat, or per-Persona, or per-Project, or globally in Big-AGI, changed only once?