onyx
onyx copied to clipboard
Automatically remove documents that are not present in the source any more
I am wondering whether there is any built-in way of automatically removing documents from Danswer that have been deleted in the source (say for example Google Drive). Looking at your source code, I don't see such a mechanism yet, even though document deletion from Vespa generally seems to be supported (when removing connectors).
If not, do you have any idea how you would go about this? If it's not too much work I am considering to contribute.
From what I could find out in a couple of minutes, the indexing pipeline is called by _run_indexing
on batches of documents, so the indexing pipeline never sees the full list of documents at once that belong to a given connector. Maybe a separate function could be implemented that cleans up all documents which are not returned by LoadConnector.load_from_state
?
I have a similar question that might be related to it. I see that it's not possible (at least via GUI) to remove files at document set/connector level (hence, triggering the reindexing). This is a problem as users might change the content of the documents, or even remove existing ones replacing them with new ones
I have a similar question that might be related to it. I see that it's not possible (at least via GUI) to remove files at document set/connector level (hence, triggering the reindexing). This is a problem as users might change the content of the documents, or even remove existing ones replacing them with new ones
Updating documents should work without any problem, but deleting documents that have been deleted from the source is currently not implemented. However, today I created a pull request #1086 which aims to fix exactly that.
@scriptator thanks a bunch for this change. Just merged it in!