private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Add new documents to an existing chroma collection

Open RonquilloAeon opened this issue 1 year ago • 6 comments

Addresses #124. With this new functionality, you can now add new documents to an existing chroma collection.

RonquilloAeon avatar May 16 '23 15:05 RonquilloAeon

There is an interesting idea in #201 , that uses the metadata already stored in the Chroma collection to avoid ingesting already existing documents again. That would make "deleting the docs after ingestion" not necessary. I honestly don't like the idea of deleting documents. What do you think?

https://github.com/imartinez/privateGPT/pull/201/files#diff-abd34d324dcce966837adc13a98b86edc8f868d80e3ae601152f81a59c59734cR52

imartinez avatar May 16 '23 17:05 imartinez

Once adding new documents without having to reload all is working reliably, periodic persistence of the db would become an effective way of avoiding massive loss of effort when a bug or resource under-allocation causes the run to be aborted. Example, to few tokens available from the MODEL_N_CTX value in the .env. An Arabic or Chinese passage or document can cause that failure many hours into the process. See "Cannot ingest Chinese text file #19" Batching sub-collections of files into the db would seem a rather easy solution.

johnbrisbin avatar May 16 '23 17:05 johnbrisbin

Once adding new documents without having to reload all is working reliably, periodic persistence of the db would become an effective way of avoiding massive loss of effort when a bug or resource under-allocation causes the run to be aborted. Example, to few tokens available from the MODEL_N_CTX value in the .env. An Arabic or Chinese passage or document can cause that failure many hours into the process. See "Cannot ingest Chinese text file #19" Batching sub-collections of files into the db would seem a rather easy solution.

Definitely a good point. I think it belongs to a different Feature Request though. Feel free to create it

imartinez avatar May 16 '23 18:05 imartinez

I honestly don't like the idea of deleting documents

What if only symlinks landed in the source docs dir? I feel no unease deleting symlinks.

jonarmani avatar May 16 '23 20:05 jonarmani

I honestly don't like the idea of deleting documents

What if only symlinks landed in the source docs dir? I feel no unease deleting symlinks.

Thats a good point. But I still think the solution proposed in #201 (checking stored files through Chroma metadata) is cleaner. No need to delete anything.

imartinez avatar May 16 '23 21:05 imartinez

I honestly don't like the idea of deleting documents

What if only symlinks landed in the source docs dir? I feel no unease deleting symlinks.

Thats a good point. But I still think the solution proposed in #201 (checking stored files through Chroma metadata) is cleaner. No need to delete anything.

Ultimately, I think it'd be good to let the user choose if source documents are deleted or not. After I put up this PR, I noticed #201. So if we want to go w/ that option, sounds good.

RonquilloAeon avatar May 17 '23 03:05 RonquilloAeon

Closing in favor of #287

imartinez avatar May 18 '23 19:05 imartinez