Langchain-Chatchat [FEATURE] 是否能更换Faiss使用的向量匹配引擎和距离度量方式

在配置文件中，我已经看到了kbs_config = { "faiss": { }, 我也观察了日志，去寻找了 2024-02-29 14:34:03,340 - loader.py[line:54] - INFO: Loading faiss with AVX2 support. 2024-02-29 14:34:03,358 - loader.py[line:56] - INFO: Successfully loaded faiss with AVX2 support. 的出处，但依然找不到初始化faiss时的参数，希望解惑

Feb 29 '24 07:02 liuchuan01

🤖

你好，@liuchuan01！很高兴见到你。我是Dosu，一个可以帮助你解决问题、回答问题、并成为贡献者的机器人。我在这里帮助你，同时等待人类维护者的回复。我会立刻看看你的问题，并尽快给你完整的答复。

Feb 29 '24 07:02 dosubot[bot]

现在已经去挖掘了源码，能看到faiss.py初始化时使用的是flat引擎和内积距离，请问我如何改变？

Feb 29 '24 07:02 liuchuan01

🤖

Based on your question, it seems you're looking to modify the initialization parameters for Faiss in the Langchain-Chatchat source code. These parameters can be found and modified in the load_vector_store method of the FaissKBService class. This method is responsible for loading the Faiss vector store with specific parameters such as kb_name, vector_name, and embed_model.

Here's a snippet of the load_vector_store method:

def load_vector_store(self) -> ThreadSafeFaiss:
    return kb_faiss_pool.load_vector_store(kb_name=self.kb_name,
                                           vector_name=self.vector_name,
                                           embed_model=self.embed_model)

To modify the initialization parameters for Faiss, you would adjust the values of self.kb_name, self.vector_name, and self.embed_model before the load_vector_store method is called. These attributes are set in the do_init method and potentially elsewhere in the class or by the class's consumer.

Here's a snippet of the do_init method:

def do_init(self):
    self.vector_name = self.vector_name or self.embed_model
    self.kb_path = self.get_kb_path()
    self.vs_path = self.get_vs_path()

To modify the initialization parameters, you could directly set these attributes (self.kb_name, self.vector_name, self.embed_model) to the desired values before the Faiss index is loaded or created. This could be done in the class's constructor (__init__ method) or before calling methods that load or manipulate the vector store, such as do_create_kb, do_add_doc, do_search, etc.

You can find more details in the source code at server/knowledge_base/kb_service/faiss_kb_service.py.

Regarding the "Loading faiss with AVX2 support" and "Successfully loaded faiss with AVX2 support" messages, these are likely coming from the underlying Faiss library itself when it is being imported or initialized, and not directly from the Langchain-Chatchat codebase. The exact source of these messages would depend on the version and configuration of the Faiss library you are using.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

server/knowledge_base/kb_service/faiss_kb_service.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Feb 29 '24 07:02 dosubot[bot]

@dosu-bot after your mention .i find this vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT") in faiss_cache.py now i can change distance_strategy but still i want to change this type of index like : flat,IVF,PQ... can you help me?

Feb 29 '24 08:02 liuchuan01

🤖

Based on the information you provided, it seems you are trying to modify the type of index used by Faiss in the Langchain-Chatchat source code. The load_vector_store method in the FaissKBService class is indeed where the Faiss index is initialized. However, the actual creation and configuration of the Faiss index are not shown in the provided context.

To change the type of Faiss index, you would likely need to modify the implementation of the kb_faiss_pool.load_vector_store method or wherever the Faiss index is initialized within the kb_faiss_pool object. This could involve specifying a different index type when creating the Faiss index.

Here's a general approach on how you might modify the index type:

Locate the Index Creation: Find where the Faiss index is created in the kb_faiss_pool or related utility. This is where you'll specify the type of index Faiss should use.
Modify the Index Type: Change the index type to the desired one. For example, if you're switching to an IVF index for faster searches, you might change the index creation code to something like:

import faiss

# Example of creating an IVF index
dimension = 128  # Example dimension, adjust based on your actual data
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)  # The quantizer for assigning vectors to clusters
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_L2)

Adjust Index Parameters: Depending on the index type, you may need to adjust additional parameters or perform extra steps, such as training the index with a sample of vectors before adding all vectors if you're using an IVF index.
Ensure Compatibility: Make sure the rest of your code that interacts with the Faiss index is compatible with the new index type. Some index types may have specific requirements or limitations.

Without access to the implementation details of kb_faiss_pool.load_vector_store or how the Faiss index is initialized and managed within your codebase, this is a general guide. You would need to adapt these steps based on the specific details of your implementation.

In addition, I found a similar solved issue in the Langchain-Chatchat repository: [FEATURE] langchain-chatchat Faiss向量库的索引配置在哪看呢. The suggested solution indicates that the Faiss vector library's index configuration can be found in the kbs_config dictionary in the configs/kb_config.py.example file. The default index type for Faiss is not IndexFlatIP, but IndexFlatL2 based on Euclidean distance. If you want to use cosine similarity, you need to normalize your vectors before adding them to the index and before querying the index.

I hope this helps! If you have any other questions, feel free to ask.

Sources:

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Feb 29 '24 08:02 dosubot[bot]

@zRzRzRzRzRzRzR 维护者你好，我深入阅读源码后了解了加载kb的机制，梳理出这张流图（在知识库问答函数最外层加载时有所不同，因为看的版本被团队其他人做过修改）我注意到在faiss_cache.py中new_vector_store函数加载时指定了距离度量方式，源码：

vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

对应的是新建知识库的场景可是当我继续挖掘源码时我发现，最终会调用langchain源码的faiss.py中的__from函数在该函数中，逻辑为

if distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
        index = faiss.IndexFlatIP(len(embeddings[0]))
else:
        # Default to L2, currently other metric types not initialized.
        index = faiss.IndexFlatL2(len(embeddings[0]))

其中的常量定义为：

MAX_INNER_PRODUCT = "MAX_INNER_PRODUCT"

这意味着new_vector_store函数加载时传递的参数并不是正确的。最终向量距离度量方式还是L2距离。请问我的理解有误吗？我是否可以提交一个PR，将传递的参数改为EUCLIDEAN_DISTANCE，这样可以在不造成实际影响的状态下减少误解。

目前我所在的团队正在持续研究本项目向量数据库（faiss）的使用优化中，也将开始测试内积、余弦相似度匹配的召回精度，希望后续可以为本项目作出更大贡献，感谢！

Mar 02 '24 13:03 liuchuan01

This is a tough nut to crack, could really use your help on this, @imClumsyPanda.

Mar 02 '24 13:03 dosubot[bot]

@zRzRzRzRzRzRzR 维护者你好，我深入阅读源码后了解了加载kb的机制，梳理出这张流图（在知识库问答函数最外层加载时有所不同，因为看的版本被团队其他人做过修改）我注意到在faiss_cache.py中new_vector_store函数加载时指定了距离度量方式，源码：
vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")
对应的是新建知识库的场景可是当我继续挖掘源码时我发现，最终会调用langchain源码的faiss.py中的__from函数在该函数中，逻辑为
if distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
        index = faiss.IndexFlatIP(len(embeddings[0]))
else:
        # Default to L2, currently other metric types not initialized.
        index = faiss.IndexFlatL2(len(embeddings[0]))
其中的常量定义为：
MAX_INNER_PRODUCT = "MAX_INNER_PRODUCT"
这意味着new_vector_store函数加载时传递的参数并不是正确的。最终向量距离度量方式还是L2距离。请问我的理解有误吗？我是否可以提交一个PR，将传递的参数改为EUCLIDEAN_DISTANCE，这样可以在不造成实际影响的状态下减少误解。

目前我所在的团队正在持续研究本项目向量数据库（faiss）的使用优化中，也将开始测试内积、余弦相似度匹配的召回精度，希望后续可以为本项目作出更大贡献，感谢！

请问最后的结果怎么样？这一段代码我这边跑也是有所疑惑，解析器提示 2024-03-08 15:49:35,289 - loader.py[line:64] - INFO: Loading faiss. 2024-03-08 15:49:35,320 - loader.py[line:66] - INFO: Successfully loaded faiss. /home/hvuser/.local/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py:121: UserWarning: Normalizing L2 is not applicable for metric type: METRIC_INNER_PRODUCT warnings.warn( 不知道是否会有影响

Mar 08 '24 07:03 liangpn

出现这个报错的原因是本地加载知识库时，会指定距离度量方式为 METRIC_INNER_PRODUCT ，但该加载不经过__from，故度量类型不会被纠正，在faiss.py的init方法调用时会有代码如下

 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )

当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。在dev的最新版上面已经修复了该方法，修复的方法是在faiss_cahce.py中

vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)

去除了手动指定的距离度量方式，最终加载时也就是默认的L2距离。

最后，创建知识库的流程也进行了纠错，可见PullRequest#3206

Mar 09 '24 00:03 liuchuan01

出现这个报错的原因是本地加载知识库时，会指定距离度量方式为 METRIC_INNER_PRODUCT ，但该加载不经过__from，故度量类型不会被纠正，在faiss.py的init方法调用时会有代码如下
 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )
当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。在dev的最新版上面已经修复了该方法，修复的方法是在faiss_cahce.py中
vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)
去除了手动指定的距离度量方式，最终加载时也就是默认的L2距离。

最后，创建知识库的流程也进行了纠错，可见PullRequest#3206

faiss_cache.py 文件_FaissPool方法大概61行，不是也有这个方法？这一段不需要改吗？ vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

Mar 15 '24 02:03 liangpn

出现这个报错的原因是本地加载知识库时，会指定距离度量方式为 METRIC_INNER_PRODUCT ，但该加载不经过__from，故度量类型不会被纠正，在faiss.py的init方法调用时会有代码如下
 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )
当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。在dev的最新版上面已经修复了该方法，修复的方法是在faiss_cahce.py中
vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)
去除了手动指定的距离度量方式，最终加载时也就是默认的L2距离。最后，创建知识库的流程也进行了纠错，可见PullRequest#3206
faiss_cache.py 文件_FaissPool方法大概61行，不是也有这个方法？这一段不需要改吗？ vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

61行的也要改的，在我开启的PR里改的就是61行。61行是新建知识库的情况，111行左右load_local是本地加载的情况

Mar 15 '24 02:03 liuchuan01

出现这个报错的原因是本地加载知识库时，会指定距离度量方式为 METRIC_INNER_PRODUCT ，但该加载不经过__from，故度量类型不会被纠正，在faiss.py的init方法调用时会有代码如下
 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )
当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。在dev的最新版上面已经修复了该方法，修复的方法是在faiss_cahce.py中
vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)
去除了手动指定的距离度量方式，最终加载时也就是默认的L2距离。最后，创建知识库的流程也进行了纠错，可见PullRequest#3206
faiss_cache.py 文件_FaissPool方法大概61行，不是也有这个方法？这一段不需要改吗？ vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")
61行的也要改的，在我开启的PR里改的就是61行。61行是新建知识库的情况，111行左右load_local是本地加载的情况

好的

Mar 15 '24 03:03 liangpn

楼主大佬你好，同样研究faiss向量数据库的内积、余弦相似度的优化，谢谢楼主的细致分析！小弟还想请教一下：

有点没搞明白 MAX_INNER_PRODUCT 和 Metric Inner product 两种距离度量方式有何不同？
Faiss 向量库默认的距离度量方式是L2 度量吗，也就是后面设置的score_threshold, 值越低，相似性越高？

Apr 02 '24 02:04 zixiaotan21

楼主大佬你好，同样研究faiss向量数据库的内积、余弦相似度的优化，谢谢楼主的细致分析！小弟还想请教一下：

有点没搞明白 MAX_INNER_PRODUCT 和 Metric Inner product 两种距离度量方式有何不同？

Faiss 向量库默认的距离度量方式是L2 度量吗，也就是后面设置的score_threshold, 值越低，相似性越高？

MAX_INNER_PRODUCT是langchain原码中规定的一个常量，用来初始化faiss， Metric Inner product 可能是LangchainChatchat开发者的笔误，暂时没有意义
是L2,由于 Metric Inner product 并不是langchian原码中列举出来的，所以实际在初始化时还是会变回欧式距离也就是L2，值越低，相似性越高后续你如果使用内积得到了和欧式距离的对比结果，可以再交流交流哦

Apr 02 '24 03:04 liuchuan01

谢谢楼主，我明白了，Metric Inner product 估计确实是笔误，所以就是按你上面所说的，把61行里最后的 distance_strategy="METRIC_INNER_PRODUCT，删掉，这样就没有歧义了，对吧？

Apr 02 '24 03:04 zixiaotan21

我大概想明白了：

如果是vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True) 最终加载的就是默认的L2距离。值越低，相似性越高。
如果修改为vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True, distance_strategy="MAX_INNER_PRODUCT" ) 加载的应该是最大内积距离，也就是余弦相似度。值越高，相似性越高。

在langchain-chatchat的使用中，感觉使用法1, 知识库问答匹配知识的时候，总会匹配出很多不太相关的知识，想问问楼主有没有试过用2 的方法采用内积的方式去检索相似向量呀，我现在准备试试对比一下效果~

Apr 02 '24 03:04 zixiaotan21

尝试2 之后发现，若指定“[距离度量方式]”为"MAX_INNER_PRODUCT"，（内积/余弦相似度度量），会出现楼上一样的 UserWarning: Normalizing L2 is not applicable for metric type: METRIC_INNER_PRODUCT 报错，想问问这种情况楼主有遇到不？

Apr 02 '24 06:04 zixiaotan21

指定度量方式改为"MAX_INNER_PRODUCT"后，知识库问答就检索不到文档了，变成大语言模型自身解答了，不知道为什么。大佬，有解决的办法吗？求

Apr 21 '24 12:04 1hfq

@zixiaotan21 您好，请问您试验后内积和L2的结果有什么区别呢？

Aug 14 '24 00:08 YeQiuO

Langchain-Chatchat Langchain-Chatchat copied to clipboard

[FEATURE] 是否能更换Faiss使用的向量匹配引擎和距离度量方式

Sources

About Dosu

Sources

About Dosu

Langchain-Chatchat
Langchain-Chatchat copied to clipboard