Use RAG for better context
What is RAG?
A brief explaination of RAG by GPT4-Turbo:
"Retrieval Augmented Generation" (RAG) is a natural language processing technique that combines traditional generative models with retrieval mechanisms. RAG first retrieves relevant information from a large database of documents, and then uses this information to aid a generative model (like a Transformer) in text generation. This approach significantly enhances the relevance and accuracy of the text because the model can utilize real-time knowledge fetched from the retrieved content, rather than solely relying on the knowledge learned during training. This technique is widely used in scenarios requiring external knowledge, such as question answering, summary generation, and more.
And here's a detailed introduction of RAG: Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models (meta.com)
Why RAG?
Currently OI uses a simple strategy to maintain the context of conversations:
- Always have the system prompts in context, which are necessary.
- Put all history messages in the context and send them to LLM, as long as the length of them doesn't exceed the
context_windowsetting. - If the length of history messages exceeds the
context_window, remove the earliest messages from the context until it can fit into.
This strategy works well on most use cases for daily tasks, however there are several problems when engaging conversations need long context, like using LLM as an assistant to summarize research papers of one field:
- High token cost: before exceeding the
context_window, the length of context sent to LLM grows linearly as the conversation going on. With models having larger and larger context window (for example, the current default modelopenai/gpt4-turbowhich has a 128K context window), this will cost a lot if a user keep asking questions in one single conversation. What's even more horrifying, if one conversation exceeded thecontext_window, the cost of each request following won't grow but will keep at a very expensive price. - Lost of early memory: for most cases, the most important context for LLM to generate a good answer are the latest messages, however sometimes there are also important information in early messages like user's instructions for the current conversation and background info.
- Noise in context: even though LLMs are capable enough to handle a lot of information and get the useful part, somehow the irrelevant information would affect the accuracy of LLMs' answer.
With RAG, we can convert history message of current into embeddings and store them into a vector database. Every time there's a new message from the user, we can use the user's input as a query to search the most relevant context from the vector database and put into the context sent to LLM. In this way, we can have flexible context length for different questions and have more useful information in the context. Besides, we can do more with RAG in the future, for example we can add an interface for users to import local documents as background information for conversations.
All in all, as a LLM client (both as an application kernel and a standalone CLI), context management is a very low-level but important part, spend some effort on this will be helpful.
How to implement RAG in OI?
I think langchain-ai/langchain: 🦜🔗 Build context-aware reasoning applications (github.com) would be a great library to import RAG as well as other cool features to enhance the context management function of OI. Anyway, this will be a tough and huge task, a lot of research, develop and test works included. Implementation details will be updated later. BTW, I am planning to implement this as an optional feature and set it off by default, which means this is only for the users who know well about what they are playing with.
@KillianLucas @Notnaton @MikeBirdTech @CyanideByte @tyfiero Do you think this will be a good feature? If so, I would start to implement this.
It is good feature. And save a lot of money. It act like a database for OI. It also help in further conversation.
Long-term memory management is a must in order for this project to become practical. RAG is a possible solution, but there are also related discussions here which you may want to look into in order to avoid duplicate efforts.
My bad, I searched about things related but with filter for "open" ones only.
Never mind, I will implement this by myself because I need it. No PR will be opened for this as pollution to codebase.