chroma
chroma copied to clipboard
Very slow when using chroma with a 4096-dimension embedding model after adding more than 10000 records
I’m using Chroma with a 4096-dimensional vectorization model(sgpt-bloom7b). When I use the add method, it takes less than 0.1 seconds to add one vector at the beginning. However, when I add up to 50,000 vectors later on, the speed of adding one vector has dropped to more than 1.5 seconds. What could be the reason for this? Is there any way to improve it?
Thats odd - can you share your data if possible?
Sorry, the data cannot be provided due to possible confidentiality issues. But I’m curious, will the collection.add() method become slower as the amount of data in the collection increases? Or is there a recommended batch size for single add? I have tried adding more than 10k records at once and it caused memory overflow.
We are experiencing a similar issue. We are embedding academic abstract data, and after ~500k records, ingestion has slowed down by more than a factor of 2. This is going to become a really big problem as we ingest all of our data, which will be on the order of 40 million records.
It seems to me that much of this slowdown is likely caused by the constant pickling and unpickling of data.
Experiencing a similar slow down
I have a use case where I will index approximately 100k (approx 1500 tokens in each doc) documents, and about 10% will be updated daily. So, I need a db that remains performant for ingestion and querying at that scale. Could you please inform us, how could we ensure decent performance on large amount of data using chroma?
@HammadB @jeffchuber
Hi everyone - we are releasing an update on Monday (with a seamless migration path) that will move from batch to incremental writes. This will massively improve performance for this kind of use case. Performance, in general, is a big focus of the core team over the coming months, especially with the focus on distributed Chroma
Completed with Chroma 0.4 - we enable incremental writes persisting to disk.
With chromadb 0.4, Our result as below:
Sometime it is very slow:
I am working with the latest version of chroma, but the issue of slow indexing as data increases still persist. I index 900k + embeddings in almost 1 hour. But for the second 900k, it index 50% of data in more than 1 hour.
Any workaround for this??
Thank You