langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Chroma vectorstore search does not return top-scored embeds

Open egils-mtx opened this issue 1 year ago • 11 comments

Summary: the Chroma vectorstore search does not return top-scored embeds.

The issue appears only when the number of documents in the vector store exceeds a certain threshold (I have ~4000 chunks). I could not determine when it breaks exactly.

I loaded my documents, chunked them, and then indexed into a vectorstore:

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(all_docs, embeddings)

Then I tried to search this vector store:

text = "my search text"
list(score for doc, score in docsearch.similarity_search_with_score(text))

Output:

[0.3361772298812866,
 0.3575538694858551,
 0.360953152179718,
 0.36677438020706177]

Search did not return the expected document from the embed (within a list of 4 items returned by default).

Then I performed another test forcing to return all scores and got the expected result.

list(score for doc, score in docsearch.similarity_search_with_score(text, len(all_docs))[:4])

Output:

[0.31715911626815796,
 0.3361772298812866,
 0.3575538694858551,
 0.360953152179718]

You can clearly see that the top scores are different.

Any help is appreciated.

egils-mtx avatar Mar 24 '23 00:03 egils-mtx

@egils-mtx are the embeddings getting recomputed by chance? OpenAI embeddings for example are not deterministic.

If you are open to sharing some code - I'd love to help more! [email protected]

jeffchuber avatar Mar 29 '23 21:03 jeffchuber

@jeffchuber no, once calculated they have not changed anyhow. Both methods were tested against the same version of vectorstore.

egils-mtx avatar Mar 30 '23 10:03 egils-mtx

@egils-mtx assuming the data is not changing, the only reason things might be different is that chroma uses an approximate nearest neighbor (ANN) algorithm called HNSH which is not deterministic. You can do KNN, known nearest neighbor, and brute force it if you need the same exact results every time. The benefit of ANN is that it scales much further. If you have low dimensionality on your embeddings, and a lot of similar data, this likely will be more poignant.

jeffchuber avatar Mar 30 '23 20:03 jeffchuber

@jeffchuber The issue is that when doing a similarity search against Chroma vectorstore it by default returns only 4 results which are not the top-scoring ones. But when I instruct to return all results then it appears there are higher-scored results that were not returned by default. It is not about the algorithm being deterministic or not.

In my example embed with score 0.31715911626815796 was not returned by default but it definitely is the best scoring result and had to be [0] while querying with docsearch.similarity_search_with_score(text) but was not.

egils-mtx avatar Mar 31 '23 07:03 egils-mtx

@egils-mtx we are looking into this some more.

jeffchuber avatar Apr 02 '23 20:04 jeffchuber

Hi @egils-mtx, I suspect this is due to a quirk of the HNSW algorithm, which is basically a greedy search through a graph. It essentially keeps a priority queue by distance of nodes to traverse next, and limits the nodes it adds to search to at most N nodes from a given neighbor. The size of this N is correlated to your recall since it controls the proportion of the graph traversed. This parameter is called efSearch or search_ef in the parlance of the algorithm.

However, in order to return at least K results, the algorithm sets efSearch to the max of the supplied efSearch and your requested K. By default, efSearch is set to 10, so by asking for len(all_docs) the search accuracy of the algorithm actually increases since max(10, len(all_docs)) ~ 4000. This is a very high efSearch for your domain, could you try passing index parameters to the collection that sets the search_ef to perhaps 50 or 100. Then I expect your results would be consistent across requests. I also suggest that you limit K to the length you actually need for optimal performance.

You can see how to set index params here https://github.com/chroma-core/chroma/blob/925023ae05789a7908707473d1c2e9ab94343ea5/chromadb/test/test_api.py#L1236, although I'm not 100% if langchain will allow the parameterization needed?

Let me know if that helps!

HammadB avatar Apr 04 '23 03:04 HammadB

I just started trying out chroma and this was a very confusing thing compared to FAISS. When I pass k=4 into:

self.similarity_search_with_score(query, 4)

it returns back

[0.92814052 0.93267518 0.97363865 0.97363865]

but if I pass 1000 it returns as first 4:

[0.62706429 0.62957144 0.635158   0.63717765]

This is highly unexpected I would say since default is k=4 and is very different behavior from other similarity functions like same functions from FAISS.

Note that as the OP said, there's no non-deterministic behavior. I checked all the text chunks and embeddings are identical from FAISS and chroma. I also checked the full list of scores if I pass k = records they are identical between FAISS and Chroma. The only problem is when k is small, but that's a problem because that's the default. And it's a problem because it's not what a user would expect to get just some documents.

In my case the k=4 from Chroma are horrible results compared to the "true" lowest distances.

pseudotensor avatar May 09 '23 07:05 pseudotensor

@pseudotensor Chroma uses HNSW under the hood and FAISS very often uses HNSW as well.

With FAISS - what algorithm are you using?

jeffchuber avatar May 10 '23 13:05 jeffchuber

@pseudotensor Chroma uses HNSW under the hood and FAISS very often uses HNSW as well.

With FAISS - what algorithm are you using?

Defaults (I apologize, only looking at this stuff last 2 days, so I'm not sure what faiss default is).

pseudotensor avatar May 10 '23 18:05 pseudotensor

Hmm I am not sure actually!

https://faiss.ai/cpp_api/structlist.html

On Wed, May 10, 2023 at 11:56 AM pseudotensor @.***> wrote:

@pseudotensor https://github.com/pseudotensor Chroma uses HNSW under the hood and FAISS very often uses HNSW as well.

With FAISS - what algorithm are you using?

Defaults (I apologize, only looking at this stuff last 2 days, so I'm not sure what faiss default is).

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/1946#issuecomment-1542659665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGZWEHJDLHOKNRKDSW5SW3XFPQGBANCNFSM6AAAAAAWF5CIMA . You are receiving this because you were mentioned.Message ID: @.***>

jeffchuber avatar May 10 '23 20:05 jeffchuber

Does score mean the "distance"? (so smaller distance means higher similarity)... it seems there's no _similarity_search_with_relevance_scores function in the chromna.py to normalize the similarity score like faiss/redis/weaviate.

jpzhangvincent avatar May 21 '23 06:05 jpzhangvincent

Hi @egils-mtx, I suspect this is due to a quirk of the HNSW algorithm, which is basically a greedy search through a graph. It essentially keeps a priority queue by distance of nodes to traverse next, and limits the nodes it adds to search to at most N nodes from a given neighbor. The size of this N is correlated to your recall since it controls the proportion of the graph traversed. This parameter is called efSearch or search_ef in the parlance of the algorithm.

However, in order to return at least K results, the algorithm sets efSearch to the max of the supplied efSearch and your requested K. By default, efSearch is set to 10, so by asking for len(all_docs) the search accuracy of the algorithm actually increases since max(10, len(all_docs)) ~ 4000. This is a very high efSearch for your domain, could you try passing index parameters to the collection that sets the search_ef to perhaps 50 or 100. Then I expect your results would be consistent across requests. I also suggest that you limit K to the length you actually need for optimal performance.

You can see how to set index params here https://github.com/chroma-core/chroma/blob/925023ae05789a7908707473d1c2e9ab94343ea5/chromadb/test/test_api.py#L1236, although I'm not 100% if langchain will allow the parameterization needed?

Let me know if that helps!

Tried this out, and set collection_metadata= {"hnsw:space" : "l2" , "hnsw:search_ef" : 100} on creation of the Chroma collection. Unfortunately, this didn't fix the issue for me. Manually setting k=25 did alleviate the issue.

coltonpeltier-db avatar Jul 17 '23 18:07 coltonpeltier-db

In my case I was using the retriever with a RetrievalQA chain. Increasing k=50 improved my retrieval results as desired, but then overflowed my context size and the RetrievalQA would no longer function. For anyone else in this boat, I worked around it by subclassing VectorStoreRetriever as so:

from langchain.vectorstores.base import VectorStoreRetriever
class VectorStoreRetriever_ChromaWorkAround(VectorStoreRetriever):
    actual_k = 4
    def get_relevant_documents(self, query):
        return super().get_relevant_documents(query)[:self.actual_k]
    
my_new_retriever = VectorStoreRetriever_ChromaWorkAround(vectorstore=vector_store, search_kwargs={'k': 50})
my_new_retriever.get_relevant_documents("sample query")

This allowed me to pass the retriever into the RetrievalQA and get the proper results.

coltonpeltier-db avatar Jul 17 '23 19:07 coltonpeltier-db

Hi, @egils-mtx! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported was about the Chroma vectorstore search not returning the top-scored embeddings when the number of documents in the vector store exceeds a certain threshold. There have been some interesting discussions in the comments. Jeffchuber suggested that the issue might be due to the use of an approximate nearest neighbor (ANN) algorithm called HNSH, which is not deterministic. HammadB suggested adjusting the search_ef parameter to ensure consistent results. Pseudotensor also pointed out that the default behavior of returning only a few documents is unexpected and different from other similarity functions like FAISS. Coltonpeltier-db even provided a workaround for improving retrieval results by subclassing VectorStoreRetriever.

It would be great if you could let us know if this issue is still relevant to the latest version of the LangChain repository. If it is, please comment on this issue to let us know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Oct 16 '23 16:10 dosubot[bot]

This bug waste my whole day

simplast avatar Jul 10 '24 11:07 simplast

I'm running into the same issue. I have a relatively low number (few hundreds) and similar documents and I was getting pretty weird results. Re-running the same similarity search would show me what I expect 50% of the time.

Setting hnsw:search_ef : 100 actually solved it for me.

mbastian avatar Jul 12 '24 12:07 mbastian

glad that helped!

we’re planning to raise ef_search by default to avoid sharp edges like this in the future.

On Fri, Jul 12, 2024 at 5:57 AM Mathieu Bastian @.***> wrote:

I'm running into the same issue. I have a relatively low number (few hundreds) and similar documents and I was getting pretty weird results. Re-running the same similarity search would show me what I expect 50% of the time.

Setting hnsw:search_ef : 100 actually solved it for me.

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/1946#issuecomment-2225524789, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGZWEH66FHMHWURQI6CCTDZL7HC7AVCNFSM6AAAAABKUW6QQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVGUZDINZYHE . You are receiving this because you were mentioned.Message ID: @.***>

jeffchuber avatar Jul 12 '24 14:07 jeffchuber