chroma Very slow when using chroma with a 4096-dimension embedding model after adding more than 10000 records

Very slow when using chroma with a 4096-dimension embedding model after adding more than 10000 records

Open mjzcng opened this issue 2 years ago • 2 comments

I’m using Chroma with a 4096-dimensional vectorization model(sgpt-bloom7b). When I use the add method, it takes less than 0.1 seconds to add one vector at the beginning. However, when I add up to 50,000 vectors later on, the speed of adding one vector has dropped to more than 1.5 seconds. What could be the reason for this? Is there any way to improve it?

Apr 27 '23 00:04 mjzcng

Thats odd - can you share your data if possible?

Apr 27 '23 00:04 HammadB

Sorry, the data cannot be provided due to possible confidentiality issues. But I’m curious, will the collection.add() method become slower as the amount of data in the collection increases? Or is there a recommended batch size for single add? I have tried adding more than 10k records at once and it caused memory overflow.

Apr 27 '23 11:04 mjzcng

We are experiencing a similar issue. We are embedding academic abstract data, and after ~500k records, ingestion has slowed down by more than a factor of 2. This is going to become a really big problem as we ingest all of our data, which will be on the order of 40 million records.

Jun 16 '23 12:06 Nicholas-Schaub

It seems to me that much of this slowdown is likely caused by the constant pickling and unpickling of data.

Jun 16 '23 12:06 Nicholas-Schaub

Experiencing a similar slow down

Jun 19 '23 22:06 AmanKishore

I have a use case where I will index approximately 100k (approx 1500 tokens in each doc) documents, and about 10% will be updated daily. So, I need a db that remains performant for ingestion and querying at that scale. Could you please inform us, how could we ensure decent performance on large amount of data using chroma?

@HammadB @jeffchuber

Jul 07 '23 20:07 SKRohit

Hi everyone - we are releasing an update on Monday (with a seamless migration path) that will move from batch to incremental writes. This will massively improve performance for this kind of use case. Performance, in general, is a big focus of the core team over the coming months, especially with the focus on distributed Chroma

Jul 08 '23 05:07 jeffchuber

Completed with Chroma 0.4 - we enable incremental writes persisting to disk.

Jul 27 '23 18:07 jeffchuber

With chromadb 0.4, Our result as below:

Sometime it is very slow:

Oct 19 '23 07:10 kientv

I am working with the latest version of chroma, but the issue of slow indexing as data increases still persist. I index 900k + embeddings in almost 1 hour. But for the second 900k, it index 50% of data in more than 1 hour.

Any workaround for this??

Thank You

Nov 17 '23 17:11 Crispae

chroma chroma copied to clipboard

Very slow when using chroma with a 4096-dimension embedding model after adding more than 10000 records

chroma
chroma copied to clipboard