[Feature Request]: Deleted files from data source (S3 or S3 compatible) are still exists in Ragflow Dataset
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Describe your problem
Version: v0.22.1 (Docker, CPU)
Deleted files in the data source are not removed from the Ragflow dataset. Is it bug or out-of-scope?
Steps to Reproduce:
- Add an S3 (or S3-compatible) bucket as a data source.
- Upload several files to the bucket, let's say
a.pdf,b.pdf,c.pdf. - Wait for Ragflow’s sync job to complete.
- Confirm that the files appear in the Ragflow dataset (sync works as expected).
- Delete
b.pdffrom the data source. - Wait for the next Ragflow sync job.
- Observe that
b.pdfstill appears in the Ragflow dataset.
Expected Behavior:
Deleted files in the data source should also be removed from the Ragflow dataset after the sync job.
Actual Behavior:
The deleted file (b.pdf) remains in the Ragflow dataset even after multiple sync cycles.
During increamental synchronization, it's not easy to delete removed files. We're working on it.
Hi! I'm the product manager. Thanks for this feedback. Why do you want this feature? Could you share your use cases? That will help us fully understand your problem.
Hi! I'm the product manager. Thanks for this feedback. Why do you want this feature? Could you share your use cases? That will help us fully understand your problem. @ZhenhangTung
Hello. We have some documents that changes the content inside frequently. And sometimes, we want to delete some files in data source because no needs for these documents.
During increamental synchronization, it's not easy to delete removed files. We're working on it. @KevinHuSh
Thanks for your work. By the way, I'm going to review this changes and test it in like 3-4 hours. PR11468
I can't build PR11468 as docker image in macbook (mbp m2 Sequoia 15.6) and I don't have windows machine. My test cases will be these:
Case 1:
- Upload a pdf (let's say
a.pdf) to S3 - Sync with Ragflow
- overwrite
a.pdf(or delete & upload with same name and same bucket path) - Expectation: in Ragflow,
a.pdfis overwritten version. When I ask a question toRagflow Chat, it gives me new information from new overwritten (deleted & uploaded)a.pdfversion
Case 2:
- Upload a pdf file to S3. (let's say
b.pdf) - Sync with Ragflow
- Delete
b.pdffrom S3 - Expectation: in Ragflow,
b.pdfis deleted and when I ask a question toRagflow Chat, it does not provide any information fromb.pdf
Could you do these tests and share us test results please? @Woody-Hu
I actually had a requirements meeting with Siemens today, and they brought up the same requirement. Any updates or deletions in S3 or other data sources should also be reflected in RAGFlow. My current thought is to make this behavior configurable on the frontend, so developers can choose whether to enable it.
imo incremental synchronization is not good for data source sync case. But I'm not contributor, so your opinions are matter, not mine :)
imo
incremental synchronizationis not good for data source sync case. But I'm not contributor, so your opinions are matter, not mine :)
I don’t think so 🙂 As a product manager, understanding everyone’s perspective is one of the most important parts of my job. So your opinions absolutely carry weight for me.
@ @ZhenhangTung @Furkan-Demir @KevinHuSh I have carefully reviewed your conversation and consulted with other users. I believe the best approach is to offer users two options for S3 file transfers: “incremental sync” and “full sync.” However, implementing this would increase the workload for developers. I hope my explanation is clear. Wish you all the best
@goldengolden7981 Yep. We need to offer an option so developers can choose the one that best fits their use case.