onyx icon indicating copy to clipboard operation
onyx copied to clipboard

Automatically remove documents that are not present in the source any more

Open scriptator opened this issue 1 year ago • 2 comments

I am wondering whether there is any built-in way of automatically removing documents from Danswer that have been deleted in the source (say for example Google Drive). Looking at your source code, I don't see such a mechanism yet, even though document deletion from Vespa generally seems to be supported (when removing connectors).

If not, do you have any idea how you would go about this? If it's not too much work I am considering to contribute.

From what I could find out in a couple of minutes, the indexing pipeline is called by _run_indexing on batches of documents, so the indexing pipeline never sees the full list of documents at once that belong to a given connector. Maybe a separate function could be implemented that cleans up all documents which are not returned by LoadConnector.load_from_state?

scriptator avatar Jan 12 '24 11:01 scriptator

I have a similar question that might be related to it. I see that it's not possible (at least via GUI) to remove files at document set/connector level (hence, triggering the reindexing). This is a problem as users might change the content of the documents, or even remove existing ones replacing them with new ones

aagirre92 avatar Jan 17 '24 15:01 aagirre92

I have a similar question that might be related to it. I see that it's not possible (at least via GUI) to remove files at document set/connector level (hence, triggering the reindexing). This is a problem as users might change the content of the documents, or even remove existing ones replacing them with new ones

Updating documents should work without any problem, but deleting documents that have been deleted from the source is currently not implemented. However, today I created a pull request #1086 which aims to fix exactly that.

scriptator avatar Feb 16 '24 14:02 scriptator

@scriptator thanks a bunch for this change. Just merged it in!

Weves avatar Feb 22 '24 00:02 Weves