chroma
chroma copied to clipboard
[Bug]:
What happened?
when I am trying to add many chunks into chroma,this txt_collection.add function is realy slow, it takes almost 10mins to process a batch of my files.
txt_collection.add(
# documents = current_chunk_contents,
embeddings = embed_split,
ids = ids,
metadatas = metadatas,
)
The whole code is:
txt_collection = client.create_collection(name=database_name, embedding_function=text_emb_fn, metadata={"hnsw:space": "cosine"})
txt_collection = client.get_collection(name=database_name, embedding_function=text_emb_fn)
batch_size = 40000
for i in range(0, len(data), batch_size):
# 生成当前块
logging.info("Processing index:" + str(i))
current_chunk = data[i:i + batch_size]
current_chunk_ids = [item['id'] for item in current_chunk]
current_chunk_contents = [item['contents'] for item in current_chunk]
ids = list(range(i, i + batch_size))
ids = [str(num) for num in ids]
# embeddings
logging.info("Begin embedding docs.")
embed_split = embeddings.embed_documents(current_chunk_contents)
logging.info("End embedding docs.")
# add
metadatas = [{"ID":num} for num in current_chunk_ids]
logging.info("Begin adding documents.")
txt_collection.add(
# documents = current_chunk_contents,
embeddings = embed_split,
ids = ids,
metadatas = metadatas,
)
logging.info("Done adding documents.")
logging.info("All done!")
logs is show below, each add process takes more than ten mins:
Versions
0.4.24
Relevant log output
No response
hey @Liuziyu77, the time Chroma takes to add 40k embeddings to an existing HNSW graph and it gets slower the larger your graph is. We have a PR that adds batch ingest #1668. The PR will not make updating the HNSW graph faster, but it will make your clients add data to Chroma faster while the graph updates happen in the background from the ingested data (it goes into the Write-Ahead Log then it gets slowly added to the HNSW index). In particular read thru this proposal - https://github.com/chroma-core/chroma/blob/56b88dae96eb8d7082134844fefbf4dbe4bc5297/docs/cip/CIP-01212024-Batch_Ingestion.md
Let me know if this might be useful, and we can try to merge the PR as soon as possible.