langchain
langchain copied to clipboard
ValueError in cosine_similarity when using FAISS index as vector store
Getting the below error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...\langchain\vectorstores\faiss.py", line 285, in max_marginal_relevance_search
docs = self.max_marginal_relevance_search_by_vector(embedding, k, fetch_k)
File "...\langchain\vectorstores\faiss.py", line 248, in max_marginal_relevance_search_by_vector
mmr_selected = maximal_marginal_relevance(
File "...\langchain\langchain\vectorstores\utils.py", line 19, in maximal_marginal_relevance
similarity_to_query = cosine_similarity([query_embedding], embedding_list)[0]
File "...\langchain\langchain\math_utils.py", line 16, in cosine_similarity
raise ValueError("Number of columns in X and Y must be the same.")
ValueError: Number of columns in X and Y must be the same.
Code to reproduce this error
>>> model_name = "sentence-transformers/all-mpnet-base-v2"
>>> model_kwargs = {'device': 'cpu'}
>>> from langchain.embeddings import HuggingFaceEmbeddings
>>> embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
>>> from langchain.vectorstores import FAISS
>>> FAISS_INDEX_PATH = 'faiss_index'
>>> db = FAISS.load_local(FAISS_INDEX_PATH, embeddings)
>>> query = 'query'
>>> results = db.max_marginal_relevance_search(query)
While going through the error it seems that in this case query_embedding
is 1 x model_dimension while embedding_list is no_docs x model dimension vectors. Hence we should probably change the code to similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0]
i.e. remove the list from the query_embedding.
Since this is a common function not sure if this change would affect other embedding classes as well.
I got this error too, something was changed in the 2-3 last langchain versions
I also got this error
Same problem here, happens when FAISS with mmr search
if you remove "search_type="mmr" from the retriever, its solved the issue... but not sure what is does \ the affect.
Hence we should probably change the code to similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0] i.e. remove the list from the query_embedding.
@dev2049, the code seems related to your PR https://github.com/hwchase17/langchain/pull/2915, any idea on this issue?
taking a look shortly
looks like an error in faiss query embedding logic. by default it uses embeddings.embed_documents
for the embedding_function (which expects a list), so when we call self.embedding_function(query)
it embeds each character separately and returns a 2d list. will fix, thanks for the catch all!
@dev2049 Great, thank you so much for resolving the issue quickly!