langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Add more index methods to faiss.

Open PhilipMay opened this issue 1 year ago • 2 comments

Feature request

At the moment faiss is hard wired to IndexFlatL2.

See here:

https://github.com/hwchase17/langchain/blob/423f497168e3a8982a4cdc4155b15fbfaa089b38/langchain/vectorstores/faiss.py#L347

I would like to set other index methods. For example IndexFlatIP. This should be configurable.

Also see more index methods here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

Motivation

If I have dot product as the distance for my embedding I must change this...

Your contribution

I can provide a PR if wanted.

PhilipMay avatar May 06 '23 12:05 PhilipMay

Can you use:

from langchain import FAISS

index = fais.IndexFlatIP()
faiss = FAISS(embedding_function, index, docstore, index_to_docstore_id)

Then use the add_texts and add_embeddings method.

prathmeshranaut avatar May 07 '23 04:05 prathmeshranaut

Yep, it is a pitty that the the FAISS LangChain utility for creating vectorestores is hardcoded to use L2 indexes... Especially considering how popular is FAISS as an open-source vectorstore and how relevant the inner product / cosine similarity is for text similarity (used by Azure OpenAI: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/understand-embeddings).

At least, cosine similarity (i.e. IndexFlatIP with the already inplace normalize_L2 flag set to True) would be a great addition to the .from_text() or .from_documents() wrappers, imho...

afdezt avatar May 23 '23 16:05 afdezt

In addition to vanilla cosine similarity I would also propose sliding window maximum cosine similarity as outlined in Section 3.2.1 of Sentence Similarity Techniques for Short vs Variable Length Text using Word Embeddings--I've found it to be empirically useful for retrieval when the prompt is very short but the relevant document is much longer. Not sure if this can be fairly easily implemented within the existing langchain framework, or if it can only be done in faiss.

AlexHuang2 avatar Jun 06 '23 22:06 AlexHuang2

Hi, @PhilipMay! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested to add more index methods to faiss, specifically the ability to set other index methods such as IndexFlatIP. There have been some suggestions in the comments, such as using the FAISS utility from LangChain to achieve this. Additionally, there was a suggestion from afdezt to add cosine similarity and include IndexFlatIP with the normalize_L2 flag set to True. AlexHuang2 also proposed adding sliding window maximum cosine similarity.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot[bot] avatar Sep 15 '23 16:09 dosubot[bot]