private-gpt this chroma database may have limits at 300,000 chunks, it cause database crush.

name: Bug report about: Chroma Database title: '' labels: bug assignees: ''

I tried several times for ingest the data to chroma, if the total chunks are higher than 300,000 chunks, it shows the programme been killed. and you can not to ingest after. you need to delete all the database and restart again until 300,000 chunks, and killed repeatly.

I may think this is because of limits of this chroma database design.

Environment (please complete the following information):

OS / hardware: [ubuntu / Alienware 18]
Python version [3.10.11]

Jun 04 '23 23:06 zxjason

No, just ingested a file with 760k chunks

Jun 06 '23 00:06 ProxyAyush

No, just ingested a file with 760k chunks

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

Jun 06 '23 00:06 zxjason

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

I had a similar problem. Dropped 3000 pdf documents in the source_documents folder and it failed to complete ingest.py.

Try training on smaller document batches. I wrote a bash for loop to copy and process about 30 pdf into source_documents; ingest.py; rm -f *.pdf per iteration. And it worked fine..

Jun 08 '23 14:06 abalib

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

I had a similar problem. Dropped 3000 pdf documents in the source_documents folder and it failed to complete ingest.py.

Try training on smaller document batches. I wrote a bash for loop to copy and process about 30 pdf into source_documents; ingest.py; rm -f *.pdf per iteration. And it worked fine..

Can you please share the bash script ??

Jun 15 '23 01:06 riturajm

My pdf names were mostly numerical characters, e.g. abc12278.pdf and so on. Main idea is to subset them in to batches in this case 100 of them. Nested for loops do the batching. Change the logic and paths according to your situation. @riturajm

for a in `seq 0 1 9`
for b in `seq 0 1 9`
do
do
rm -f ./source_documents/*.pdf
cp *$a$b.pdf ./source_documents/
python ingest.py
done
done

Jun 15 '23 02:06 abalib

I used the same process, like I passed data in batches. It worked fine!

Jan 12 '24 07:01 AnnemSony

No, just ingested a file with 760k chunks

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

what is the chunk size with respect to the character count in the text

Apr 08 '24 08:04 preetham003