chroma
chroma copied to clipboard
[Bug]: Chromadb will fail to return the embeddings with the closest results unless I set n_results to a sufficiently large number.
What happened?
Chromadb will fail to return the embeddings with the closest results unless I set n_results to a sufficiently large number.
I am using version 0.4.22 but this problem has happened with every version I've used. I find that basic querying of the db is buggy and does not return the highest scoring results if you don't have n_results set to a sufficiently big number. For example, if I perform this search with n_results=10, it will fail to find the correct highest scoring embeddings. I have ~44000 pieces of text in my DB. Nut when i set n_results to 50 or 100, it shows the correct values.
Versions
Chroma 0.4.22
Relevant log output
No response
Did we find a resolution to this issue? It occurred during our testing as well, even when we had configured it to 50, which also didn't seem to work, but when we do brute force it works out in simple cosine / l2 .
This is unusual behavior and not expected. We have not yet reproduced this - if either of you are able to either share your data, or code for generating data that causes this behavior, this would allow us to debug.
@snayan06, as an experiment, can you do the following:
collection = client.get_or_create_collection("my_collecction", metadata={"hnsw:search_ef": 100})
collection.query(...,n_result=10) # your usual query with n_result=10 which is the default
From the HNSW perspective, n_result
and search_ef
are interchangeable, and the lib will pick the higher of the two. However, search_ef is configured automatically, so you can still continue to use lower n_results
and get the same benefit.
Hi everyone, the bug still exists on the newer chroma too. Just FYI.
chroma-hnswlib==0.7.3
chromadb==0.5.0
@AbhiPawar5 can you please share your data so we can repro?
Hi @atroyn, I'm working with a customer's data so I can't share it outside the org. However, what I noticed was the bug is persistent across embedding models. So I'm guessing it is in chroma's core search/HNSW implementation.
@AbhiPawar5 when creating a new collection, could you try setting the hnsw:search_ef
collection metadata key to 50?
i.e.
collection = client.create_collection(
name="collection_name",
metadata={"hnsw:search_ef": 50}
)
I suspect what's happening here is that in some cases, particularly where the index is being constructed iteratively, the default search_ef
is too low (10), and should be larger by default. Increasing the number of results increases search_ef
above 10, leading to better recall.
We have an issue open for improving index parametrization in general, which should help here: https://github.com/chroma-core/chroma/issues/2285