Langchain-Chatchat icon indicating copy to clipboard operation
Langchain-Chatchat copied to clipboard

[BUG]chroma向量数据库匹配结果不正确

Open HO-UN opened this issue 1 year ago • 4 comments

问题描述 / Problem Description 在使用chroma作为向量数据库时,不管提问的问题是什么总是返回3个相同的匹配文段,并且返回的匹配文段与问题毫不相关。

复现问题的步骤 / Steps to Reproduce 使用chroma进行文段向量存储,进行提问

预期的结果 / Expected Result 应该匹配到正确的文段并正确回答

实际结果 / Actual Result 一直返回错误却重复的3个文段 Uploading 1708498132565.jpg…

环境信息 / Environment Information

  • langchain-ChatGLM 版本/commit 号:(例如:v2.0.1 或 commit 123456) / langchain-ChatGLM version/commit number: (e.g., v2.0.1 or commit 123456)
  • 是否使用 Docker 部署(是/否):是 / Is Docker deployment used (yes/no): yes
  • 使用的模型(ChatGLM2-6B / Qwen-7B 等):ChatGLM-6B / Model used (ChatGLM2-6B / Qwen-7B, etc.): ChatGLM2-6B
  • 使用的 Embedding 模型(moka-ai/m3e-base 等):m3e-base
  • 使用的向量库类型 (faiss / milvus / pg_vector 等):chroma
  • 操作系统及版本 / Operating system and version:win10
  • Python 版本 / Python version:3.10

HO-UN avatar Feb 21 '24 07:02 HO-UN

🤖

你好,@HO-UN!很高兴再次见到你。我正在查看你的问题,很快会给你一个完整的答复。

dosubot[bot] avatar Feb 21 '24 07:02 dosubot[bot]

1708498132565

HO-UN avatar Feb 21 '24 07:02 HO-UN

🤖

Thank you for providing detailed information about the issue you're experiencing. I'm sorry to hear that you're having trouble with the Chroma Vector Database in the Langchain-Chatchat application.

Based on the information you've provided, it seems like the issue might be related to the do_search method in the ChromaKBService class. This method is responsible for retrieving documents based on their similarity to the query embeddings. If it's always returning the same three paragraphs, it's possible that there's an issue with how the similarity is being calculated or how the results are being sorted.

Unfortunately, I'm unable to view the screenshot you've provided as I can't process images. However, I can suggest a few things that might help you troubleshoot the issue:

  1. Check the query embeddings: Make sure that the embeddings for your queries are being generated correctly. You can do this by checking the output of the encode method in the EmbeddingService class.

  2. Check the similarity calculation: The do_search method uses the search method of the ChromaDB class to calculate the similarity between the query embeddings and the document embeddings. You might want to check if this calculation is being done correctly.

  3. Check the result sorting: After the similarity is calculated, the results are sorted based on their similarity scores. You might want to check if this sorting is being done correctly.

  4. Check the ChromaDB configuration: Make sure that the ChromaDB is configured correctly. You can do this by checking the ChromaDB class and the ChromaKBService class.

  5. Update to the latest version: You mentioned that you're using version v2.0.1 of the Langchain-ChatGLM. The latest version is v0.2.10. Updating to the latest version might resolve the issue if it's been fixed in a later release.

If you're still experiencing the issue after trying these steps, it would be helpful if you could provide more information about the queries you're using and the results you're getting. This will help us better understand the issue and find a solution.

For more information, you can refer to the following files in the Langchain-Chatchat repository:

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Feb 21 '24 08:02 dosubot[bot]

估计是数据准备的有问题,匹配困难

zRzRzRzRzRzRzR avatar Feb 24 '24 10:02 zRzRzRzRzRzRzR

这个问题已经被标记为 stale ,因为它已经超过 30 天没有任何活动。

github-actions[bot] avatar Mar 25 '24 21:03 github-actions[bot]

这个应该是没做归一化的问题,我有空看看

tongtie avatar Mar 28 '24 14:03 tongtie

请问这个问题解决了吗

Shame-fight avatar Apr 01 '24 01:04 Shame-fight

请问这个问题解决了吗

试试:https://github.com/chatchat-space/Langchain-Chatchat/pull/3640

showmecodett avatar Apr 05 '24 05:04 showmecodett

@HO-UN 试试: https://github.com/chatchat-space/Langchain-Chatchat/pull/3640

showmecodett avatar Apr 05 '24 05:04 showmecodett