[Bug]: add_documents gets slower with each call
What happened?
I have 2 million articles that are being chunked into roughly 12 million documents using langchain. I want to run a search over these documents so I would like to have them into ideally one chroma db. Would the quickest way to insert millions of documents into chroma db be to insert all of them upon db creation or to use db.add_documents(). Right now I'm doing it in db.add_documents() in chunks of 100,000 but the time to add_documents seems to get longer and longer with each call. Should I just try inserting all 12million chunks when I create it, I have a GPU and a lot storage and it used to take 30 min per 100K but now were at a little past an hour to add_document 100k documents.
Versions
runnning coda on a VM, 1 GPU
Relevant log output
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import SentenceTransformerEmbeddings
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
model_path = "./multi-qa-MiniLM-L6-cos-v1/"
model_kwargs = {"device": "cuda"}
embeddings = SentenceTransformerEmbeddings(model_name="./multi-qa-MiniLM-L6-cos-v1/", model_kwargs=model_kwargs)
documents_array = documents[0:100000]
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len,
is_separator_regex=False,
)
docs = text_splitter.create_documents(documents_array)
persist_directory = "chroma_db"
vectordb = Chroma.from_documents(
documents=docs, embedding=embeddings, persist_directory=persist_directory
)
vectordb.persist()
vectordb._collection.count()
docs = text_splitter.create_documents(documents[500000:600000])
def batch_process(documents_arr, batch_size, process_function):
for i in range(0, len(documents_arr), batch_size):
batch = documents_arr[i:i + batch_size]
process_function(batch)
def add_to_chroma_database(batch):
vectordb.add_documents(documents=batch)
batch_size = 41000
batch_process(docs, batch_size, add_to_chroma_database)
@tazarov would this be a good place to clean up the WAL? That's probably part of why it's getting slower.
WAL is definitely one thing. Yet, I feel there is something odd here. I've personally created collections of about 10M+ embeddings, and the approximate runtime for adding 1M embeddings goes up as the HNSW binary index increases. Also I need to check LC's impl, but it might be slower than adding the docs with Chroma persistent client directly.
@saachishenoy, when you create your collection, you can specify hnsw:batch_size and hnsw:sync_threshold. The batch size controls the in-memory (aka brute force buffer size), whereas the threshold controls how frequently Chroma will dump the binary index to disk. The rule of thumb is batch size < threshold. That said, try bumping the batch size to 10k (or more) and the threshold to 20-50k.
Still, the slowest part of Chroma is adding the vectors to the HNSW index; generally, that cannot be sped up too much. This brings me to my next question, @saachishenoy: What CPU arch are you running Chroma on? If it is Intel, then there is a good chance that rebuilding the HNSW lib for AVX support will boost performance.
Any progress on this? Trying to build a collection of similar size and inserts are getting very slow
Closing due to inactivity for some time. If you are still running into this issue in a newer version of Chroma (0.6.0 or later), please feel free to open a new issue!