langchain icon indicating copy to clipboard operation
langchain copied to clipboard

ValueError in cosine_similarity when using FAISS index as vector store

Open infinite-Joy opened this issue 1 year ago • 3 comments

Getting the below error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\langchain\vectorstores\faiss.py", line 285, in max_marginal_relevance_search
    docs = self.max_marginal_relevance_search_by_vector(embedding, k, fetch_k)
  File "...\langchain\vectorstores\faiss.py", line 248, in max_marginal_relevance_search_by_vector
    mmr_selected = maximal_marginal_relevance(
  File "...\langchain\langchain\vectorstores\utils.py", line 19, in maximal_marginal_relevance
    similarity_to_query = cosine_similarity([query_embedding], embedding_list)[0]
  File "...\langchain\langchain\math_utils.py", line 16, in cosine_similarity
    raise ValueError("Number of columns in X and Y must be the same.")
ValueError: Number of columns in X and Y must be the same.

Code to reproduce this error

>>> model_name = "sentence-transformers/all-mpnet-base-v2"
>>> model_kwargs = {'device': 'cpu'}
>>> from langchain.embeddings import HuggingFaceEmbeddings
>>> embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
>>> from langchain.vectorstores import FAISS
>>> FAISS_INDEX_PATH = 'faiss_index'
>>> db = FAISS.load_local(FAISS_INDEX_PATH, embeddings)
>>> query = 'query'
>>> results = db.max_marginal_relevance_search(query)

While going through the error it seems that in this case query_embedding is 1 x model_dimension while embedding_list is no_docs x model dimension vectors. Hence we should probably change the code to similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0] i.e. remove the list from the query_embedding.

Since this is a common function not sure if this change would affect other embedding classes as well.

infinite-Joy avatar Apr 23 '23 07:04 infinite-Joy

I got this error too, something was changed in the 2-3 last langchain versions

moraneden avatar Apr 23 '23 12:04 moraneden

I also got this error

hramtsov avatar Apr 23 '23 12:04 hramtsov

Same problem here, happens when FAISS with mmr search

yummydum avatar Apr 24 '23 13:04 yummydum

if you remove "search_type="mmr" from the retriever, its solved the issue... but not sure what is does \ the affect.

moraneden avatar Apr 24 '23 20:04 moraneden

Hence we should probably change the code to similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0] i.e. remove the list from the query_embedding.

@dev2049, the code seems related to your PR https://github.com/hwchase17/langchain/pull/2915, any idea on this issue?

martin-liu avatar Apr 24 '23 23:04 martin-liu

taking a look shortly

dev2049 avatar Apr 25 '23 00:04 dev2049

looks like an error in faiss query embedding logic. by default it uses embeddings.embed_documents for the embedding_function (which expects a list), so when we call self.embedding_function(query) it embeds each character separately and returns a 2d list. will fix, thanks for the catch all!

dev2049 avatar Apr 25 '23 01:04 dev2049

@dev2049 Great, thank you so much for resolving the issue quickly!

martin-liu avatar Apr 25 '23 03:04 martin-liu