anything-llm [BUG]: Differences in Behavior Between Threads API and Workspace in AnythingLLM

How are you running AnythingLLM?

Docker (remote machine)

What happened?

Hello everyone,

I have been implementing a system over the past few months and noticed some peculiarities in the API's behavior. The main issue is that the API does not respond exactly the same way as it does within the workspace on the platform, even when using the threads API. It seems the handling is different.

Specifically, I believe the difference may be related to not considering document checks with embeddings. My system is a chat accessed externally via API. Initially, I used the workspace API (/chat), but I realized that all users were communicating in the same conversation. I solved this by creating a separate thread for each user.

However, the behavior is still not the same as inside AnythingLLM. My workspace is configured in query mode and contains documents in RAG. Another point is that the API returns too many emojis in conversations, which doesn't happen within the system.

My main question is: does the API actually handle things differently? Initially, I thought using the threads API would result in the exact same behavior, but it seems that's not the case.

Thanks in advance for any help or clarification.

Are there known steps to reproduce?

a thread in workspace with file embedded, a app calling the workspace api threads. same question, different answers.

Oct 23 '24 20:10 Peterson047

Do you know what image (and hash if possible) you are on? We made some edits recently to this functionality that may be able to explain the discrepancy.

Oct 24 '24 22:10 timothycarambat

Sorry, I'm using a version of the master tag from 7 weeks ago, since the last one is from 3 weeks ago, I thought the production server was with this version. I saw that a new version came out yesterday, I will test it in an environment and return with the feedback.

"Id": "sha256:984ed5766441e000a91fe61019204a814a83542836dc7ee76d78a39ebeac6276", "RepoTags": [ "mintplexlabs/anythingllm:master" ], "RepoDigests": [ "mintplexlabs/anythingllm@sha256:6bc0b731a15a9d933d52d7f575cc61aa6e1b7e0b77a2d9e1b0f7fc36285ac355", "mintplexlabs/anythingllm@sha256:d1a25203ac2b4af5d5ff691c304f17281cd6fd864c44c30f5ba531c7f134dbbf" ], "Parent": "", "Comment": "buildkit.dockerfile.v0", "Created": "2024-08-30T22:24:24.336182248Z",

Oct 25 '24 14:10 Peterson047

Hi there!

I would like to expand on this report because I think I am experiencing a similar issue.

Description

When running a Retrieval-Augmented Generation (RAG) workflow in AnythingLLM, discrepancies in behavior and response structure are observed between the AnythingLLM workspace, its chat widget, and the external API. The main issues include:

Chat Widget vs. Cloud Instance:
- The chat widget fails to retrieve URLs from embedded documents and produces poorly structured responses. It appears to deviate from the intended prompt despite being configured identically to the cloud instance.
API Behavior Deviations:
- Using the threads API for external chat applications yields inconsistent results compared to the AnythingLLM workspace. Specifically:
  - Responses are structurally different from the response structure instructed via prompts.
  - It appears that document checks using embeddings are not being considered consistently in API responses.
Session Context Leakage:
- Querying the LLM with simple inputs like "ping" in a new session occasionally returns RAG information, as if it retains context from previous sessions or threads. This behavior is particularly problematic for multi-user scenarios.
Workspace Thread Behavior:
- Using the threads API initially resulted in shared conversation context across users, which was mitigated by creating separate threads for each user. However, even with separate threads, the API’s responses differ from the workspace’s behavior.

Expected Behavior

Consistent behavior and responses across the AnythingLLM workspace, chat widget, and API threads.
Adherence to the same prompt structure and retrieval logic, including proper embedding checks for documents.
No session context leakage across threads or sessions.
Appropriate and consistent use of emojis, as configured.

Actual Behavior

Chat Widget: Poorly structured responses with missing document URLs.
API: Inconsistent responses compared to the workspace and potential lack of embedding checks.
Session Context: Leakage of context across sessions or threads.

Steps to Reproduce

Set up a RAG-enabled AnythingLLM workspace with embedded documents.
Query the same question:
- In the workspace.
- Via the chat widget.
- Via the threads API in an external chat application.
Compare the responses for consistency in structure, document retrieval, and adherence to the prompt.
Start a new thread/session and query with a simple input like "ping", observing for context leakage.

Dec 05 '24 11:12 carneiran

You captured exactly what I noticed during my tests. I'm currently working with DialogFlow, and it allows you to create new sessions automatically, based on any field in the JSON request, such as the user's WhatsApp number. The sessions created this way are identical to those generated internally, without distinction.

Perhaps unifying or developing a new method of session management in the API, instead of simply creating threads, would work better. I hope the team implements this soon or finds an even better solution, because the main problem I face today is this discrepancy in responses between the different implementations, in addition to the lack of a more adequate session management via API.

I could even create a web server to manage the threads automatically, associating the ID of each thread with an external ID of the request, but I believe that this would not be the ideal solution.

Dec 05 '24 12:12 Peterson047

@Peterson047

, in addition to the lack of a more adequate session management via API.

How are you sure this is not just the LLM hallucinating? The session chats between threads are not shared. This is visible explicitly in the codebase and you can debug it in transit to the LLM provider and see clearly this is not occurring with the messages available.

@thurkul

Chat Widget vs. Cloud Instance: The chat widget fails to retrieve URLs from embedded documents and produces poorly structured responses. It appears ?> to deviate from the intended prompt despite being configured identically to the cloud instance.

The code is the same between a local docker instance and a cloud instance. I presume this is referring to the application's UI vs the widget?

How embed handles chats: https://github.com/Mintplex-Labs/anything-llm/blob/6c9e234227aa32a02ba6a05ee978534760e3fd74/server/utils/chats/embed.js

How UI chats are handled: https://github.com/Mintplex-Labs/anything-llm/blob/6c9e234227aa32a02ba6a05ee978534760e3fd74/server/utils/chats/stream.js

Between the two, embed.workspace in embed is the same object as workspace in stream. Comparing the two flows they are the same. Just because the same message is provided to an LLM does not guarantee the same response, even with the same parameters like temp and so on.

Using the threads API for external chat applications yields inconsistent results compared to the AnythingLLM workspace. Specifically: Responses are structurally different from the response structure instructed via prompts. It appears that document checks using embeddings are not being considered consistently in API responses.

Again here, I would like to see more evidence of this behavior - to reiterate - and LLM receiving the same message array and params can and likely will produce various initial responses under the same inputs. This is especially true with local LLMs and heavily quantized models. It is not elaborated what provider and model you are using as larger cloud-based LLMs dont have this issue as often since they are much more powerful.

If you require a specific structure output and you define it in a system prompt this should guide the LLM to respond in said format, but its not a guarantee. It is an LLM and its responses are not deterministic, especially with local or heavily quantized small param models.

Session Context Leakage:

Querying the LLM with simple inputs like "ping" in a new session occasionally returns RAG information, as if it retains context from previous sessions or threads. This behavior is particularly problematic for multi-user scenarios.

Here, again evidence of chat messages being logged and showing that context and history from thread A is present in thread B would be required to substantiate.

Querying the LLM with simple inputs like "ping" in a new session occasionally returns RAG information, as if it retains context from previous sessions or threads.

This is just how RAG works, all threads share documents under a workspace, but not chat histories! So if the RAG response contains citations this simply means the vector database assumed that the citation was possibly relevant to the query/prompt. This is not problematic on its face, since this is how RAG works - however, if you want more "strict" citation behavior you can do so by setting the similarity threshold

Workspace Thread Behavior:

Using the threads API initially resulted in shared conversation context across users, which was mitigated by creating separate threads for each user. However, even with separate threads, the API’s responses differ from the workspace’s behavior.

Again, here we would need some logs of context sharing in the messages sent to the LLM to rule out this being model behavior or hallucinations.

However, even with separate threads, the API’s responses differ from the workspace’s behavior.

Even when giving the same exact prompt and setting a model response can be different. The main issue is if it is accurate or not to the query. The response differing is almost expected, the main concern is if you get totally invalid nonsense between the API and the workspace. If both are correct but worded differently - that is just LLMs for you!

I am happy to look into any of this if we can get some solid reproductions of context leaking or sharing. I can easily debug messages being sent to my model provider and the message arrays do not share any context between chats nor users.

Other implementation details like what model, quant, and even code implementation is relevant for the API side of things.

It is also worth knowing that if two users share the exact same thread id then they are using the same history. A long time ago we added the sessionID param to API chats, which is a foreign key you can use to chat with workspaces over the API without managing threads. This is the recommended way to support multi-user delineation in workspaces.

eg:

POST /v1/workspace/{slug}/chat
{
  "message": "What is AnythingLLM?",
  "mode": "chat",
  "sessionId": "identifier-one",
}
>> Now any subsequent chats with identifier-one will have it's history

POST /v1/workspace/{slug}/chat
{
  "message": "What is AnythingLLM?",
  "mode": "chat",
  "sessionId": "identifier-two",
}
>> Now any subsequent chats with identifier-two will have it's history, it has no context on the other sessionID

History is loaded from https://github.com/Mintplex-Labs/anything-llm/blob/6c9e234227aa32a02ba6a05ee978534760e3fd74/server/utils/chats/index.js#L38 via https://github.com/Mintplex-Labs/anything-llm/blob/6c9e234227aa32a02ba6a05ee978534760e3fd74/server/utils/chats/apiChatHandler.js#L144

So when using /api/workspace/{slug}/{chat, stream-chat} https://github.com/Mintplex-Labs/anything-llm/blob/6c9e234227aa32a02ba6a05ee978534760e3fd74/server/endpoints/api/workspace/index.js#L659

If no sessionID is passed, it is null and all chats will be shared for all requests since there is no key to delineate them.

If you chat with api/workspace/{slug}/thread/{threadSlug}/{chat,stream-chat} Then you will hit this: https://github.com/Mintplex-Labs/anything-llm/blob/6c9e234227aa32a02ba6a05ee978534760e3fd74/server/endpoints/api/workspace/index.js#L659

Which will load history via user/thread ID overlap.

So let's see what we find, thinking back I think just may be some confusion on the API implementation - which can certainly be improved

Dec 05 '24 18:12 timothycarambat

We have encountered the same operational issue as @thurkul. The responses via API are ALWAYS shorter. Additionally, if we ask for the sources of information or to cite associated image URLs, we achieve almost 100% success using Anything-LLM. However, if we use it via API, the sources of information in the response disappear and we rarely get it to cite associated image URLs. The URL sanitization or something in between is altering the response with the API 100% sure.

Dec 09 '24 16:12 aeehliver

I'm testing the latest version. Apparently they fixed this. I can see all the chat that was made via the API directly in the thread via the web interface.

Dec 10 '24 15:12 Peterson047