private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

this chroma database may have limits at 300,000 chunks, it cause database crush.

Open zxjason opened this issue 2 years ago • 5 comments


name: Bug report about: Chroma Database title: '' labels: bug assignees: ''


I tried several times for ingest the data to chroma, if the total chunks are higher than 300,000 chunks, it shows the programme been killed. and you can not to ingest after. you need to delete all the database and restart again until 300,000 chunks, and killed repeatly.

I may think this is because of limits of this chroma database design.

Environment (please complete the following information):

  • OS / hardware: [ubuntu / Alienware 18]
  • Python version [3.10.11]

zxjason avatar Jun 04 '23 23:06 zxjason

No, just ingested a file with 760k chunks IMG_1012

ProxyAyush avatar Jun 06 '23 00:06 ProxyAyush

No, just ingested a file with 760k chunks IMG_1012

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

zxjason avatar Jun 06 '23 00:06 zxjason

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

I had a similar problem. Dropped 3000 pdf documents in the source_documents folder and it failed to complete ingest.py.

Try training on smaller document batches. I wrote a bash for loop to copy and process about 30 pdf into source_documents; ingest.py; rm -f *.pdf per iteration. And it worked fine..

abalib avatar Jun 08 '23 14:06 abalib

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

I had a similar problem. Dropped 3000 pdf documents in the source_documents folder and it failed to complete ingest.py.

Try training on smaller document batches. I wrote a bash for loop to copy and process about 30 pdf into source_documents; ingest.py; rm -f *.pdf per iteration. And it worked fine..

Can you please share the bash script ??

riturajm avatar Jun 15 '23 01:06 riturajm

My pdf names were mostly numerical characters, e.g. abc12278.pdf and so on. Main idea is to subset them in to batches in this case 100 of them. Nested for loops do the batching. Change the logic and paths according to your situation. @riturajm

for a in `seq 0 1 9`
for b in `seq 0 1 9`
do
do
rm -f ./source_documents/*.pdf
cp *$a$b.pdf ./source_documents/
python ingest.py
done
done

abalib avatar Jun 15 '23 02:06 abalib

I used the same process, like I passed data in batches. It worked fine!

AnnemSony avatar Jan 12 '24 07:01 AnnemSony

No, just ingested a file with 760k chunks IMG_1012

Is it about the memory usage? I don't know why I cant pass 300,000 chunks.

what is the chunk size with respect to the character count in the text

preetham003 avatar Apr 08 '24 08:04 preetham003