chroma
chroma copied to clipboard
[Bug]: Client & Persistent Client are retrieving different documents
What happened?
Hi Team,
I noticed when I am using Client and Persistent client I am getting different docs. I have crossed check the indexes, embeddings the length of docs all are exactly same.
There is no problem with saving and loading from persistent client there I am getting the same results.
But the problem is with Persistent Client.
I am attaching example here:
Docs from Normal Client k=4
[['This provides a daily snapshot of the ...', 'This is the description of....', 'Table1', 'Table2']]
Docs from Persistent Client k=4
[['Table1', 'Table2', 'Table3', 'Table4']]
So when i am running with Persistent client some how it is removing my top 2 docs which I am getting from normal client.
I checked in local files the docs and embeddings for this top 2 is stored.
Could you please help me, from where exactly the issue is coming.
Thanks, Sparsh
Versions
Chroma: 0.4.17
Relevant log output
No response
@sparshbhawsar, thanks for raising this. Do you have a short snippet of your add/query with some sample data to help with reproducing this?
Side note: Is the bug reproducible in Chroma 0.5.0?
Hi @tazarov, Yes the issue still in 0.5.0 version.
I can't provide the data, it's confidential but i can share the code using which you can reproduce this.
import chroma db
### Using Normal Client
chroma_client = chromadb.Client()
from chromadb import Documents, EmbeddingFunction, Embeddings
Class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
embeddings = Your Embeddings
return embeddings
collection = chroma_client.create_collection( name="test", embedding_function=MyEmbeddingFunction(), metadata={"hnsw:space": "cosine"} )
# docs = Your Document
collection.add(ids=[str(i) for i in range(len(docs))], documents=[d.page_content for d in docs], metadatas=[d.metadata for d in docs])
collection.query( query_embeddings==[Query Vector], n_results=3 )
### Using Persistent Client (Saving to disk)
persistent_client = chromadb.PersistentClient(path="/path/to/save/to”)
from chromadb import Documents, EmbeddingFunction, Embeddings
Class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
embeddings = Your Embeddings
return embeddings
persistent_collection = persistent_client.create_collection( name="test", embedding_function=MyEmbeddingFunction(), metadata={"hnsw:space": "cosine"} )
# docs = Your Document
persistent_collection.add(ids=[str(i) for i in range(len(docs))], documents=[d.page_content for d in docs], metadatas=[d.metadata for d in docs])
persistent_collection.query( query_embeddings==[Query Vector], n_results=3 )
Hi @tazarov, any update on the issue ?
@sparshbhawsar,
I've tried with:
import chromadb
### Using Normal Client
chroma_client = chromadb.Client()
collection = chroma_client.create_collection( name="test123", metadata={"hnsw:space": "cosine"} )
docs = ["This provides a daily snapshot of the ...", "This is the description of....","Table1","Table2"]
collection.add(ids=[str(i) for i in range(len(docs))], documents=[d for d in docs])
qr = collection.query( query_texts=["description of snapshot table"], n_results=4)
print(qr)
### Using Persistent Client (Saving to disk)
persistent_client = chromadb.PersistentClient(path="./2134")
persistent_collection = persistent_client.create_collection( name="test", metadata={"hnsw:space": "cosine"} )
# docs = Your Document
persistent_collection.add(ids=[str(i) for i in range(len(docs))], documents=[d for d in docs])
qr1 = persistent_collection.query( query_texts=["description of snapshot table"], n_results=4 )
print(qr1)
A few things to note about the above code is that it relies on the default embedding function (it is not great with cosine, but it works. It yields consistent results for both clients. We do a lot of testing around the consistency of things, so I wonder what conditions you see this problem under. I have two suspects:
- Data
- Custom Embedding functions
I think next step is for me to work on the first by getting a little more "decent" dataset than just 4 docs. You mentioned that your dataset is private, but can you give me an indication of how many records (embeddings) you add to Chroma and whether your topK results have small or large distances between each other?