private-gpt
private-gpt copied to clipboard
Incremental training (indigesting) of the data
Is your feature request related to a problem? Please describe. This is not a problem but an additional feathure.
Describe the solution you'd like One may constantly adding more training documents into the source folder. When one run indigest.py, would it be possible to speed it up to add only the newly added documents? If would be nice, it can also detect if some old documents has been removed and update the database accordingly.
Thanks a lot~!
Thank you so much~!
That was the wrong commit, I will update it soon!
Yes, this is good feature. You can add to privateGPT how h2oGPT does it, which also allows update from CLI or running case from UI.
https://github.com/h2oai/h2ogpt/blob/main/gpt_langchain.py#L122-L154
Note it's important to not just add new docs, but delete the prior instances. That block shows how to do that.
From h2oGPT, in UI, one just adds a new file into the originally-specified folder and click on "refresh sources" to update the db. I use hash to detect file changes and add new files. If files are deleted, however, I don't remove anything from db. I think that's most common desired behavior: