chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Chromadb will fail to return the embeddings with the closest results unless I set n_results to a sufficiently large number.

Open rohitkumar7100 opened this issue 1 year ago • 3 comments

What happened?

Chromadb will fail to return the embeddings with the closest results unless I set n_results to a sufficiently large number.

I am using version 0.4.22 but this problem has happened with every version I've used. I find that basic querying of the db is buggy and does not return the highest scoring results if you don't have n_results set to a sufficiently big number. For example, if I perform this search with n_results=10, it will fail to find the correct highest scoring embeddings. I have ~44000 pieces of text in my DB. Nut when i set n_results to 50 or 100, it shows the correct values.

Versions

Chroma 0.4.22

Relevant log output

No response

rohitkumar7100 avatar Feb 19 '24 16:02 rohitkumar7100

Did we find a resolution to this issue? It occurred during our testing as well, even when we had configured it to 50, which also didn't seem to work, but when we do brute force it works out in simple cosine / l2 .

snayan06 avatar May 10 '24 14:05 snayan06

This is unusual behavior and not expected. We have not yet reproduced this - if either of you are able to either share your data, or code for generating data that causes this behavior, this would allow us to debug.

atroyn avatar May 10 '24 22:05 atroyn

@snayan06, as an experiment, can you do the following:

collection = client.get_or_create_collection("my_collecction", metadata={"hnsw:search_ef": 100})

collection.query(...,n_result=10) # your usual query with n_result=10 which is the default

From the HNSW perspective, n_result and search_ef are interchangeable, and the lib will pick the higher of the two. However, search_ef is configured automatically, so you can still continue to use lower n_results and get the same benefit.

tazarov avatar May 11 '24 16:05 tazarov

Hi everyone, the bug still exists on the newer chroma too. Just FYI.

chroma-hnswlib==0.7.3
chromadb==0.5.0

AbhiPawar5 avatar Jun 10 '24 13:06 AbhiPawar5

@AbhiPawar5 can you please share your data so we can repro?

atroyn avatar Jun 10 '24 19:06 atroyn

Hi @atroyn, I'm working with a customer's data so I can't share it outside the org. However, what I noticed was the bug is persistent across embedding models. So I'm guessing it is in chroma's core search/HNSW implementation.

AbhiPawar5 avatar Jun 11 '24 04:06 AbhiPawar5

@AbhiPawar5 when creating a new collection, could you try setting the hnsw:search_ef collection metadata key to 50?

i.e.


collection = client.create_collection(
        name="collection_name",
        metadata={"hnsw:search_ef": 50} 
    )

I suspect what's happening here is that in some cases, particularly where the index is being constructed iteratively, the default search_ef is too low (10), and should be larger by default. Increasing the number of results increases search_ef above 10, leading to better recall.

We have an issue open for improving index parametrization in general, which should help here: https://github.com/chroma-core/chroma/issues/2285

atroyn avatar Jun 11 '24 18:06 atroyn