chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Client & Persistent Client are retrieving different documents

Open sparshbhawsar opened this issue 1 year ago • 4 comments

What happened?

Hi Team,

I noticed when I am using Client and Persistent client I am getting different docs. I have crossed check the indexes, embeddings the length of docs all are exactly same.

There is no problem with saving and loading from persistent client there I am getting the same results.

But the problem is with Persistent Client.

I am attaching example here:

Docs from Normal Client k=4

[['This provides a daily snapshot of the ...', 'This is the description of....', 'Table1', 'Table2']]

Docs from Persistent Client k=4

[['Table1', 'Table2', 'Table3', 'Table4']]

So when i am running with Persistent client some how it is removing my top 2 docs which I am getting from normal client.

I checked in local files the docs and embeddings for this top 2 is stored.

Could you please help me, from where exactly the issue is coming.

Thanks, Sparsh

Versions

Chroma: 0.4.17

Relevant log output

No response

sparshbhawsar avatar May 04 '24 07:05 sparshbhawsar

@sparshbhawsar, thanks for raising this. Do you have a short snippet of your add/query with some sample data to help with reproducing this?

Side note: Is the bug reproducible in Chroma 0.5.0?

tazarov avatar May 04 '24 15:05 tazarov

Hi @tazarov, Yes the issue still in 0.5.0 version.

I can't provide the data, it's confidential but i can share the code using which you can reproduce this.

import chroma db 

### Using Normal Client 
chroma_client = chromadb.Client()

from chromadb import Documents, EmbeddingFunction, Embeddings 

Class MyEmbeddingFunction(EmbeddingFunction): 
def __call__(self, input: Documents) -> Embeddings: 
     embeddings = Your Embeddings 
     return embeddings 

collection = chroma_client.create_collection( name="test", embedding_function=MyEmbeddingFunction(), metadata={"hnsw:space": "cosine"} ) 

# docs = Your Document 

collection.add(ids=[str(i) for i in range(len(docs))], documents=[d.page_content for d in docs], metadatas=[d.metadata for d in docs])

collection.query( query_embeddings==[Query Vector], n_results=3 ) 


### Using Persistent Client (Saving to disk)
persistent_client = chromadb.PersistentClient(path="/path/to/save/to”) 

from chromadb import Documents, EmbeddingFunction, Embeddings 

Class MyEmbeddingFunction(EmbeddingFunction): 
def __call__(self, input: Documents) -> Embeddings: 
       embeddings = Your Embeddings 
       return embeddings 

persistent_collection = persistent_client.create_collection( name="test", embedding_function=MyEmbeddingFunction(), metadata={"hnsw:space": "cosine"} ) 

# docs = Your Document 

persistent_collection.add(ids=[str(i) for i in range(len(docs))], documents=[d.page_content for d in docs], metadatas=[d.metadata for d in docs])

persistent_collection.query( query_embeddings==[Query Vector], n_results=3 )

sparshbhawsar avatar May 04 '24 17:05 sparshbhawsar

Hi @tazarov, any update on the issue ?

sparshbhawsar avatar May 07 '24 02:05 sparshbhawsar

@sparshbhawsar,

I've tried with:

import chromadb

### Using Normal Client 
chroma_client = chromadb.Client()


collection = chroma_client.create_collection( name="test123", metadata={"hnsw:space": "cosine"} )

docs = ["This provides a daily snapshot of the ...", "This is the description of....","Table1","Table2"] 

collection.add(ids=[str(i) for i in range(len(docs))], documents=[d for d in docs])

qr = collection.query( query_texts=["description of snapshot table"], n_results=4)

print(qr)

### Using Persistent Client (Saving to disk)
persistent_client = chromadb.PersistentClient(path="./2134")


persistent_collection = persistent_client.create_collection( name="test", metadata={"hnsw:space": "cosine"} )

# docs = Your Document 

persistent_collection.add(ids=[str(i) for i in range(len(docs))], documents=[d for d in docs])

qr1 = persistent_collection.query( query_texts=["description of snapshot table"], n_results=4 )

print(qr1)

A few things to note about the above code is that it relies on the default embedding function (it is not great with cosine, but it works. It yields consistent results for both clients. We do a lot of testing around the consistency of things, so I wonder what conditions you see this problem under. I have two suspects:

  • Data
  • Custom Embedding functions

I think next step is for me to work on the first by getting a little more "decent" dataset than just 4 docs. You mentioned that your dataset is private, but can you give me an indication of how many records (embeddings) you add to Chroma and whether your topK results have small or large distances between each other?

tazarov avatar May 07 '24 16:05 tazarov