private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Incremental training (indigesting) of the data

Open chenle02 opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. This is not a problem but an additional feathure.

Describe the solution you'd like One may constantly adding more training documents into the source folder. When one run indigest.py, would it be possible to speed it up to add only the newly added documents? If would be nice, it can also detect if some old documents has been removed and update the database accordingly.

Thanks a lot~!

chenle02 avatar Jun 05 '23 14:06 chenle02

Thank you so much~!

chenle02 avatar Jun 05 '23 15:06 chenle02

That was the wrong commit, I will update it soon!

armoliss avatar Jun 05 '23 15:06 armoliss

Yes, this is good feature. You can add to privateGPT how h2oGPT does it, which also allows update from CLI or running case from UI.

https://github.com/h2oai/h2ogpt/blob/main/gpt_langchain.py#L122-L154

Note it's important to not just add new docs, but delete the prior instances. That block shows how to do that.

From h2oGPT, in UI, one just adds a new file into the originally-specified folder and click on "refresh sources" to update the db. I use hash to detect file changes and add new files. If files are deleted, however, I don't remove anything from db. I think that's most common desired behavior:

image

pseudotensor avatar Jun 05 '23 16:06 pseudotensor