chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]:

Open Liuziyu77 opened this issue 1 year ago • 1 comments

What happened?

when I am trying to add many chunks into chroma,this txt_collection.add function is realy slow, it takes almost 10mins to process a batch of my files.

  txt_collection.add(
      # documents = current_chunk_contents,
      embeddings = embed_split,
      ids = ids,
      metadatas = metadatas,
  )

The whole code is:

txt_collection = client.create_collection(name=database_name, embedding_function=text_emb_fn, metadata={"hnsw:space": "cosine"})
txt_collection = client.get_collection(name=database_name, embedding_function=text_emb_fn)
batch_size = 40000  
for i in range(0, len(data), batch_size):
    # 生成当前块
    logging.info("Processing index:" + str(i))
    current_chunk = data[i:i + batch_size]
    current_chunk_ids = [item['id'] for item in current_chunk]
    current_chunk_contents = [item['contents'] for item in current_chunk]

    ids = list(range(i, i + batch_size))
    ids = [str(num) for num in ids]
    
    # embeddings
    logging.info("Begin embedding docs.")
    embed_split = embeddings.embed_documents(current_chunk_contents)
    logging.info("End embedding docs.")
    
    # add
    metadatas = [{"ID":num} for num in current_chunk_ids]
    logging.info("Begin adding documents.")
    txt_collection.add(
        # documents = current_chunk_contents,
        embeddings = embed_split,
        ids = ids,
        metadatas = metadatas,
    )
    logging.info("Done adding documents.")
    
logging.info("All done!")

logs is show below, each add process takes more than ten mins: 30086c58debc5f33274c139428ca735

Versions

0.4.24

Relevant log output

No response

Liuziyu77 avatar Apr 15 '24 03:04 Liuziyu77

hey @Liuziyu77, the time Chroma takes to add 40k embeddings to an existing HNSW graph and it gets slower the larger your graph is. We have a PR that adds batch ingest #1668. The PR will not make updating the HNSW graph faster, but it will make your clients add data to Chroma faster while the graph updates happen in the background from the ingested data (it goes into the Write-Ahead Log then it gets slowly added to the HNSW index). In particular read thru this proposal - https://github.com/chroma-core/chroma/blob/56b88dae96eb8d7082134844fefbf4dbe4bc5297/docs/cip/CIP-01212024-Batch_Ingestion.md

Let me know if this might be useful, and we can try to merge the PR as soon as possible.

tazarov avatar Apr 15 '24 11:04 tazarov