chroma
chroma copied to clipboard
[Bug]: Warning raised when query to Persistent Client
What happened?
I just use collection.query at my PersistentClient collection, and the a lot of logger warnings raised.
The ingested id all shows up as warnings.
It happens at chromadb/segment/impl/vector/local_persistent_hnsw.py
logger.warning(f"Add of existing embedding ID: {id}")
I can't figure out why all ids already existed access. And I don't know why it occur warning log, yet it is not add code.
It works perfectly, but logging is so verbose and confused that it try to add whole ids at collection all again. Even I didn't put any corpus to collection, just load and query it.
Versions
chroma-hnswlib==0.7.3 chromadb==0.4.22
It happens linux(ubuntu) and mac both. Python 3.10
Relevant log output
No response
Hi @vkehfdl1 - could you provide more context? It seems like you are attempting to .add entries with the same id again, could you please share the code where you're ingesting data?
My code looks like this.
def vectordb_pure(query: str, top_k: int, collection: chromadb.Collection,
embedding_model: BaseEmbedding):
embedded_queries = list(map(embedding_model.get_query_embedding, queries))
id_result = []
for embedded_query in embedded_queries:
result = collection.query(query_embeddings=embedded_query, n_results=top_k)
id_result.extend(result[‘ids’])
return id_result
def main():
db = chromadb.PersistentClient(path=db_path)
collection = db.get_collection(name=collection_name)
embedding_model = OpenAIEmbedding() # LlamaIndex Embedding
top_k = 5
tasks = [vectordb_pure(input_queries, top_k, collection, embedding_model) for input_queries in queries]
loop = asyncio.get_event_loop()
results = loop.run_until_complete(process_batch(tasks, batch_size=batch))
If I execute this kind of code, it occurs a lot warning that I add a existing id, like I mentioned. I didn’t even try to add id to ChromaDB, just try to query...
What does process_batch do? It looks like it might add embeddings - since you are using a persistent client, the collection will have been loaded when you do get_collection - is it possible this collection already contains records with ids you loaded before?
Here is process_batch It's just run the given task in for loop. So, it did not add any embeddings.
def process_batch(tasks, batch_size: int = 64) -> List[Any]:
results = []
for i in range(0, len(tasks), batch_size):
batch = tasks[i:i + batch_size]
batch_results = await asyncio.gather(*batch)
results.extend(batch_results)
return results
Of course collection contain ids, but I don't add any embeddings at any code.
When I delete the warning line, it works fine. (Its feature is nothing wrong, just raise warning)
Maybe @tazarov fix this issue at #1763. Hope to merge it quickly. Thx:)
Same problem upon searching
@vkehfdl1, @joaomdmoura, we've found the root cause for this and working on a fix