chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Warning raised when query to Persistent Client

Open vkehfdl1 opened this issue 1 year ago • 6 comments

What happened?

I just use collection.query at my PersistentClient collection, and the a lot of logger warnings raised. The ingested id all shows up as warnings. It happens at chromadb/segment/impl/vector/local_persistent_hnsw.py

logger.warning(f"Add of existing embedding ID: {id}")

I can't figure out why all ids already existed access. And I don't know why it occur warning log, yet it is not add code.

It works perfectly, but logging is so verbose and confused that it try to add whole ids at collection all again. Even I didn't put any corpus to collection, just load and query it.

Versions

chroma-hnswlib==0.7.3 chromadb==0.4.22

It happens linux(ubuntu) and mac both. Python 3.10

Relevant log output

No response

vkehfdl1 avatar Feb 18 '24 17:02 vkehfdl1

Hi @vkehfdl1 - could you provide more context? It seems like you are attempting to .add entries with the same id again, could you please share the code where you're ingesting data?

atroyn avatar Feb 19 '24 22:02 atroyn

My code looks like this.

def vectordb_pure(query: str, top_k: int, collection: chromadb.Collection,
                        embedding_model: BaseEmbedding):
    embedded_queries = list(map(embedding_model.get_query_embedding, queries))
    id_result = []
    for embedded_query in embedded_queries:
        result = collection.query(query_embeddings=embedded_query, n_results=top_k)
        id_result.extend(result[‘ids’])
    return id_result

def main():
    db = chromadb.PersistentClient(path=db_path)
    collection = db.get_collection(name=collection_name)
    embedding_model = OpenAIEmbedding() # LlamaIndex Embedding
    top_k = 5
    tasks = [vectordb_pure(input_queries, top_k, collection, embedding_model) for input_queries in queries]
    loop = asyncio.get_event_loop()
    results = loop.run_until_complete(process_batch(tasks, batch_size=batch))

If I execute this kind of code, it occurs a lot warning that I add a existing id, like I mentioned. I didn’t even try to add id to ChromaDB, just try to query...

vkehfdl1 avatar Feb 22 '24 14:02 vkehfdl1

What does process_batch do? It looks like it might add embeddings - since you are using a persistent client, the collection will have been loaded when you do get_collection - is it possible this collection already contains records with ids you loaded before?

atroyn avatar Feb 22 '24 23:02 atroyn

Here is process_batch It's just run the given task in for loop. So, it did not add any embeddings.

def process_batch(tasks, batch_size: int = 64) -> List[Any]:
    results = []
    for i in range(0, len(tasks), batch_size):
        batch = tasks[i:i + batch_size]
        batch_results = await asyncio.gather(*batch)
        results.extend(batch_results)

    return results

Of course collection contain ids, but I don't add any embeddings at any code. When I delete the warning line, it works fine. (Its feature is nothing wrong, just raise warning)

Maybe @tazarov fix this issue at #1763. Hope to merge it quickly. Thx:)

vkehfdl1 avatar Mar 02 '24 16:03 vkehfdl1

Same problem upon searching

joaomdmoura avatar Mar 26 '24 04:03 joaomdmoura

@vkehfdl1, @joaomdmoura, we've found the root cause for this and working on a fix

tazarov avatar Apr 24 '24 12:04 tazarov