private-gpt
private-gpt copied to clipboard
Add new documents to an existing chroma collection
Addresses #124. With this new functionality, you can now add new documents to an existing chroma collection.
There is an interesting idea in #201 , that uses the metadata already stored in the Chroma collection to avoid ingesting already existing documents again. That would make "deleting the docs after ingestion" not necessary. I honestly don't like the idea of deleting documents. What do you think?
https://github.com/imartinez/privateGPT/pull/201/files#diff-abd34d324dcce966837adc13a98b86edc8f868d80e3ae601152f81a59c59734cR52
Once adding new documents without having to reload all is working reliably, periodic persistence of the db would become an effective way of avoiding massive loss of effort when a bug or resource under-allocation causes the run to be aborted. Example, to few tokens available from the MODEL_N_CTX value in the .env. An Arabic or Chinese passage or document can cause that failure many hours into the process. See "Cannot ingest Chinese text file #19" Batching sub-collections of files into the db would seem a rather easy solution.
Once adding new documents without having to reload all is working reliably, periodic persistence of the db would become an effective way of avoiding massive loss of effort when a bug or resource under-allocation causes the run to be aborted. Example, to few tokens available from the MODEL_N_CTX value in the .env. An Arabic or Chinese passage or document can cause that failure many hours into the process. See "Cannot ingest Chinese text file #19" Batching sub-collections of files into the db would seem a rather easy solution.
Definitely a good point. I think it belongs to a different Feature Request though. Feel free to create it
I honestly don't like the idea of deleting documents
What if only symlinks landed in the source docs dir? I feel no unease deleting symlinks.
I honestly don't like the idea of deleting documents
What if only symlinks landed in the source docs dir? I feel no unease deleting symlinks.
Thats a good point. But I still think the solution proposed in #201 (checking stored files through Chroma metadata) is cleaner. No need to delete anything.
I honestly don't like the idea of deleting documents
What if only symlinks landed in the source docs dir? I feel no unease deleting symlinks.
Thats a good point. But I still think the solution proposed in #201 (checking stored files through Chroma metadata) is cleaner. No need to delete anything.
Ultimately, I think it'd be good to let the user choose if source documents are deleted or not. After I put up this PR, I noticed #201. So if we want to go w/ that option, sounds good.
Closing in favor of #287