Langchain-Chatchat 如何实现混合检索

@dosu-bot 我希望在知识库问答内引入自定义的检索方法，我应该怎么修改，从而在faiss向量数据库的基础上进行字面匹配检索？如果我还希望进行关键词检索，向量检索命中某个csv列时返回该行的另一列，应该怎么修改？在上述基础上，我希望调用rerank模型和已有的向量检索结果进行精排，怎么实现？

May 12 '24 03:05 WSC741606

To implement the mixed retrieval system you're envisioning for your knowledge base Q&A, incorporating custom search methods alongside FAISS vector database searches, literal match retrieval, keyword searches, and reranking with a model, you can follow a structured approach:

Integrating Literal Match and FAISS Vector Searches: Begin by enhancing your system to support both literal match searches, which rely on direct string comparisons, and FAISS vector searches. For the vector searches, you'll need to use the FAISS library to create a vector database of your documents. This involves converting your documents and queries into vectors and then using FAISS to identify the nearest neighbors.
Keyword Searches with Specific CSV Data Returns: To implement keyword searches that target a specific column in a CSV file and return data from another column, you'll need to organize your CSV data in a searchable format. Consider using a database or an in-memory data structure that maps keywords to their corresponding outputs. This setup will allow you to efficiently retrieve the desired information based on the keywords present in the query.
Incorporating a Rerank Model: For reranking the initial search results, select an appropriate rerank model that can process the initial results and reorder them based on more sophisticated criteria or algorithms. This step is crucial for refining the search outcomes and ensuring that the most relevant results are presented to the user.

By systematically implementing these components, you'll be able to build a comprehensive mixed retrieval system that leverages the strengths of both literal and vector-based searches, enriched with the precision of keyword-specific data retrieval and the sophistication of reranking algorithms. This approach will significantly enhance the accuracy and relevance of the search results in your knowledge base Q&A system.

Details

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

May 12 '24 03:05 dosubot[bot]

@dosu 给出需要修改的文件和具体的程序实现

May 12 '24 04:05 WSC741606

To implement the functionalities you're looking for in the Langchain-Chatchat system, here are the specific steps and modifications needed:

For Mixed Literal and FAISS Vector Searches: Modify server/knowledge_base/kb_cache/faiss_cache.py. Extend the ThreadSafeFaiss class to include methods for both literal match and vector searches. Implement a literal_search method for direct string comparisons and a mixed_search method that combines literal and vector searches.
For Keyword Searches in a CSV File: Adjust document_loaders/FilteredCSVloader.py. Enhance the FilteredCSVLoader class to support keyword searches that target a specific column for the search and return data from another column. Implement a keyword_search method that filters rows based on the search keyword and returns the desired column data.
For Incorporating a Rerank Model: Update server/reranker/reranker.py. Modify the LangchainReranker class or the CrossEncoder model to rerank initial search results based on relevance to the query. Implement a rerank method that scores documents against the query and sorts them by relevance.

These steps provide a structured approach to integrating custom search methods, including literal match, keyword search, and reranking functionalities, into your system.

Details

Sources

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

May 12 '24 04:05 dosubot[bot]

你好，以下是一种方式：

打开server/chat/knowledge_base_chat.py文件
第62行左右

    async def knowledge_base_chat_iterator(
            query: str,
            top_k: int,
            history: Optional[List[History]],
            model_name: str = model_name,
            prompt_name: str = prompt_name,
    ) -> AsyncIterable[str]:

第99行context = "\n".join([doc.page_content for doc in docs])，这个context就是根据你的query从向量库中检索到的文本内容
在代码中找到query和context后，你就可以自定义一个检索函数，输入query得到结果，并把结果直接加到context后面即可，例如

context = "\n".join([doc.page_content for doc in docs])
context += my_search_func(query)

注意第101行if len(docs) == 0: # 如果没有找到相关文档，使用empty模板，建议改为if len(docs) == 0 and <你的自定义检索也没有找到结果>: ，因为如果向量库没有匹配到的话if len(docs) == 0，系统不会走知识库问答逻辑，这时你自定义的检索函数即使检索到了内容也不会生效。（这种情况一般会在向量库没有数据或设置知识匹配分数阈值较低时出现）

May 13 '24 10:05 lkjj-lkjj

感谢回复！我试试~

May 13 '24 10:05 WSC741606

哥，请问你有实现吗？

May 22 '24 08:05 tuqingwen

https://github.com/Chocolate-Black/Langchain-MO-AI-Chat @tuqingwen 我参考这个修改实现了

May 22 '24 08:05 WSC741606

十分谢谢！我去看看！！

May 22 '24 09:05 tuqingwen

打扰一下，请问你能给出你修改后的代码吗？可以参考一下不

May 23 '24 08:05 tuqingwen

@tuqingwen 我改的地方乱七八糟的，除了这个功能之外还改过一些别的，忘了动过哪些了。。。。他那个印象中用最新版会有get***id的问题，不行就回退到他用的那个版本的langchain-chatchat，好像是3月的吧，估计能用

May 23 '24 08:05 WSC741606

他那个我记得也带了完整的langchain-chatchat版本，或者就在他的基础上二次开发

May 23 '24 08:05 WSC741606

May 28 '24 02:05 nilin1998

@nilin1998 就是他的实现里面没有doc对应ID的读取和删除，需要自己实现那几个***by_ids函数，不行就回退到他用的那个版本的langchain-chatchat，好像是3月的吧，估计能用

May 28 '24 05:05 WSC741606

请问您那部分有实现嘛

May 28 '24 06:05 nilin1998

@nilin1998 实现了，但是我改的地方乱七八糟的，除了这个功能之外还改过一些别的，忘了动过哪些了。。。。

May 28 '24 06:05 WSC741606

您好，请问您那边实现了嘛

May 28 '24 06:05 nilin1998

@nilin1998 实现了，但是我改的地方乱七八糟的，除了这个功能之外还改过一些别的，忘了动过哪些了。。。。

May 28 '24 07:05 WSC741606

问一下，请问您有添加query改写这部分功能吗？

May 28 '24 09:05 tuqingwen

@tuqingwen 我试过用LLM改写输入的prompt再传进知识库，在webui_pages/dialogue/dialogue.py添加了一个对话模式实现的，LLM的问答用一个新的prompt模板，先问答一次再传进知识库的部分效果我感觉很依赖于用的LLM

May 28 '24 09:05 WSC741606

May 28 '24 09:05 tuqingwen

您好，请问您实际测了m25算法生效了嘛？我发现m25没生效，混合检索仅仅faiss生效了，排查是里面的documents_number一直是0，导致整个算法都没生效

---原始邮件--- 发件人: @.> 发送时间: 2024年5月28日(周二) 晚上6:00 收件人: @.>; 抄送: @.@.>; 主题: Re: [chatchat-space/Langchain-Chatchat] 如何实现混合检索 (Issue #3994)

你好，以下是一种方式：

打开server/chat/knowledge_base_chat.py文件

第62行左右 async def knowledge_base_chat_iterator( query: str, top_k: int, history: Optional[List[History]], model_name: str = model_name, prompt_name: str = prompt_name, ) -> AsyncIterable[str]:
其中的参数query就是你的提问

第99行context = "\n".join([doc.page_content for doc in docs])，这个context就是根据你的query从向量库中检索到的文本内容

在代码中找到query和context后，你就可以自定义一个检索函数，输入query得到结果，并把结果直接加到context后面即可，例如 context = "\n".join([doc.page_content for doc in docs]) context += my_search_func(query)
注意第101行if len(docs) == 0: # 如果没有找到相关文档，使用empty模板，建议改为if len(docs) == 0 and <你的自定义检索也没有找到结果>: ，因为如果向量库没有匹配到的话if len(docs) == 0，系统不会走知识库问答逻辑，这时你自定义的检索函数即使检索到了内容也不会生效。（这种情况一般会在向量库没有数据或设置知识匹配分数阈值较低时出现）

感谢回复！我试试~

哥，请问你有实现吗？

https://github.com/Chocolate-Black/Langchain-MO-AI-Chat @tuqingwen 我参考这个修改实现了

打扰一下，请问你能给出你修改后的代码吗？可以参考一下不

您好，请问您那边实现了嘛

@nilin1998 实现了，但是我改的地方乱七八糟的，除了这个功能之外还改过一些别的，忘了动过哪些了。。。。

问一下，请问您有添加query改写这部分功能吗？

@tuqingwen 我试过用LLM改写输入的prompt再传进知识库，在webui_pages/dialogue/dialogue.py添加了一个对话模式实现的，LLM的问答用一个新的prompt模板，先问答一次再传进知识库的部分效果我感觉很依赖于用的LLM 感谢您的回答！请问可以加下您的联系方式嘛

@.*** 公开场合我就不发VX了，您给我发个邮件

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

May 29 '24 01:05 nilin1998

确实有生效的。直接使用Langchain-MO-AI-Chat这个项目好像是会出现一些问题，我的建议是你用chatchat这个项目，混合检索部分就参考MO-ai，然后对应的参数改正确，尤其是server/knowledge_base这个包的文件内容修改好

May 29 '24 01:05 tuqingwen

May 29 '24 01:05 nilin1998

没有的

May 29 '24 02:05 tuqingwen

那对mix_kb_service.py文件有进行修改嘛

May 29 '24 02:05 nilin1998

@nilin1998 生效了的，但是我是调库替代了里面手算的bm25，然后混合检索部分也是自己对两部分分数归一化后加和如果bm25一直是检索不到文档的话，建议是先试试只用bm25确认下是不是能匹配到，也有可能是文档和query就是没被bm25匹配上，那个库有支持只用bm25

May 29 '24 02:05 WSC741606

我分别使用bm25和使用混合检索，建立两个知识库，bm25都没有生效。原因和上面的老哥一样里面的documents_number一直是0，导致整个算法都没生效。

问题找到了，把这里的not_refresh_vs_cache=true改一下就可以了。

Jul 31 '24 07:07 Jieszs

666666刚看到邮件通知哈哈哈哈哈哈哈，前面我倒是没注意到过有这个问题，不知道是不是后面版本更新后有的参数

Aug 01 '24 03:08 WSC741606

问题：bm25检索和语义向量相似性检索应该都是在知识库已经建立好了，query输入后检索知识库才会用的吧？

上贴描述：我分别使用bm25和使用混合检索，建立两个知识库，bm25都没有生效。原因和上面的老哥一样里面的documents_number一直是0，导致整个算法都没生效。

Aug 21 '24 11:08 Sn-HIT

我认为是的，先知识库把文本转换为向量，再在检索的时候把输入的query用相同方法转成向量，然后比较向量的相似性；如果没有先建库就没有可比较的向量（库），那就不会有向量足够相似（检索命中）的文档返回

Aug 21 '24 11:08 WSC741606

Langchain-Chatchat Langchain-Chatchat copied to clipboard

如何实现混合检索

Details

Details

Langchain-Chatchat
Langchain-Chatchat copied to clipboard