[Bug]: ContextChatEngine drops user query in favor of retriever context when context is longer than memory token limit

Open gregory-shklover opened this issue 1 year ago • 1 comments

Bug Description

In ContextChatEngine.chat() there are these lines:

all_messages = prefix_messages + self._memory.get( initial_token_count=prefix_messages_token_count )

When prefix_messages_token_count > self._memory.token_limit, ChatMemoryBuffer returns an empty list, effectively dropping the most important part: user query previously pushed as the last message in self._memory.

Isntead, need a way to drop part of the prefix_messages in favor of the user query.

Version

0.9.7

Steps to Reproduce

memory = ChatMemoryBuffer.from_defaults(token_limit=100) # explicitly low for reproduction purposes

documents = [TextNode(text=''' def linalg.norm(x, ord=None, axis=None, keepdims=False): """ Matrix or vector norm.

This function is able to return one of eight different matrix norms, or one of an infinite number of vector norms (described below), depending on the value of the ord parameter.

Parameters:

xarray_like: Input array. 
    If axis is None, x must be 1-D or 2-D, unless ord is None. If both axis and ord are None, the 2-norm of x.ravel will be returned.
ord: {non-zero int, inf, -inf, ‘fro’, ‘nuc’}, optional
    Order of the norm (see table under Notes). inf means numpy’s inf object. The default is None.
axis: {None, int, 2-tuple of ints}, optional.
    If axis is an integer, it specifies the axis of x along which to compute the vector norms. If axis is a 2-tuple, it specifies the axes that hold 2-D matrices, and the matrix norms of these matrices are computed. If axis is None then either a vector norm (when x is 1-D) or a matrix norm (when x is 2-D) is returned. The default is None.
keepdims: bool, optional
    If this is set to True, the axes which are normed over are left in the result as dimensions with size one. With this option the result will broadcast correctly against the original x.

Returns: nfloat or ndarray
    Norm of the matrix or vector(s).
"""

''')]

llm = AzureOpenAI(model=API_MODEL_NAME, api_version=API_VERSION, azure_endpoint=API_BASE, azure_deployment=API_DEPLOYMENT_NAME, engine=API_DEPLOYMENT_NAME, api_key=API_KEY)

embed_model = AzureOpenAIEmbedding( azure_endpoint=API_BASE, azure_deployment=API_EMBEDDING_NAME, api_version=API_VERSION, api_key=API_KEY )

index = GPTVectorStoreIndex( documents, service_context=ServiceContext.from_defaults(llm=llm, embed_model=embed_model) )

memory = ChatMemoryBuffer.from_defaults(token_limit=50)

chat_engine = index.as_chat_engine( chat_mode=ChatMode.CONTEXT, memory=memory, system_prompt='', context_template="Provided the following python APIs:\n\n{context_str}\n\nRespond with python code.", similarity_top_k=3 )

set a breakpoint inside ContextChatEngine.chat and inspect all_messages being

chat_engine.chat('Write python code to compute norm of numpy.array defined in variable "x"')

Relevant Logs/Tracbacks

No response

Dec 27 '23 07:12 gregory-shklover

🤖

Hello @gregory-shklover,

I'm Dosu, a friendly bot here to assist you with your LlamaIndex issues and queries while we wait for a human maintainer. I'm designed to help troubleshoot bugs, answer your questions, and guide you on your journey as a contributor. Let's tackle this issue together!

Thank you for bringing this issue to our attention. It seems like you're encountering a problem where the user query is dropped when the prefix_messages_token_count exceeds the ChatMemoryBuffer token limit. This is indeed a problem as the user query is a crucial part of the chat context.

Based on a similar issue found in the LlamaIndex repository, the problem arises when the total number of tokens in the input exceeds the maximum token limit set for the model. The total number of tokens is calculated by adding the tokens in the system prompt, the tokens in the chat messages, and the tokens required for formatting. If the total number of tokens exceeds the model's token limit, the ChatMemoryBuffer returns an empty list, effectively dropping the user query.

To resolve this issue, you can modify the _get_prefix_messages_with_context method in the ContextChatEngine class to check if the token count of the prefix messages exceeds the memory limit. If it does, you can drop the oldest messages until the token count is within the limit. Here's how you can modify the method:

def _get_prefix_messages_with_context(self, context_str: str) -> List[ChatMessage]:
    """Get the prefix messages with context."""
    # ensure we grab the user-configured system prompt
    system_prompt = ""
    prefix_messages = self._prefix_messages
    if (
        len(self._prefix_messages) != 0
        and self._prefix_messages[0].role == MessageRole.SYSTEM
    ):
        system_prompt = str(self._prefix_messages[0].content)
        prefix_messages = self._prefix_messages[1:]

    context_str_w_sys_prompt = system_prompt.strip() + "\n" + context_str
    prefix_messages = [
        ChatMessage(content=context_str_w_sys_prompt, role=MessageRole.SYSTEM),
        *prefix_messages,
    ]

    # Check if the token count of the prefix messages exceeds the memory limit
    prefix_messages_token_count = len(
        self._memory.tokenizer_fn(
            " ".join([(m.content or "") for m in prefix_messages])
        )
    )
    memory_token_limit = self._memory.token_limit

    # If it does, drop the oldest messages until the token count is within the limit
    while prefix_messages_token_count > memory_token_limit:
        prefix_messages.pop(1)  # remove the oldest message
        prefix_messages_token_count = len(
            self._memory.tokenizer_fn(
                " ".join([(m.content or "") for m in prefix_messages])
            )
        )

    return prefix_messages

This modification ensures that the user query is always retained, even when the prefix_messages_token_count exceeds the memory_token_limit. The oldest messages are dropped until the token count is within the memory limit.

Please let me know if this solution works for you or if you need further assistance.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Dec 27 '23 07:12 dosubot[bot]

llama_index llama_index copied to clipboard

[Bug]: ContextChatEngine drops user query in favor of retriever context when context is longer than memory token limit

Bug Description

Version

Steps to Reproduce

set a breakpoint inside ContextChatEngine.chat and inspect all_messages being

Relevant Logs/Tracbacks

Sources

llama_index
llama_index copied to clipboard