ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Feature Request]: Deleted files from data source (S3 or S3 compatible) are still exists in Ragflow Dataset

Open Furkan-Demir opened this issue 1 month ago • 9 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

Version: v0.22.1 (Docker, CPU)

Deleted files in the data source are not removed from the Ragflow dataset. Is it bug or out-of-scope?

Steps to Reproduce:

  1. Add an S3 (or S3-compatible) bucket as a data source.
  2. Upload several files to the bucket, let's say a.pdf, b.pdf, c.pdf.
  3. Wait for Ragflow’s sync job to complete.
  4. Confirm that the files appear in the Ragflow dataset (sync works as expected).
  5. Delete b.pdf from the data source.
  6. Wait for the next Ragflow sync job.
  7. Observe that b.pdf still appears in the Ragflow dataset.

Expected Behavior:

Deleted files in the data source should also be removed from the Ragflow dataset after the sync job.

Actual Behavior:

The deleted file (b.pdf) remains in the Ragflow dataset even after multiple sync cycles.

Furkan-Demir avatar Nov 21 '25 16:11 Furkan-Demir

During increamental synchronization, it's not easy to delete removed files. We're working on it.

KevinHuSh avatar Nov 24 '25 02:11 KevinHuSh

Hi! I'm the product manager. Thanks for this feedback. Why do you want this feature? Could you share your use cases? That will help us fully understand your problem.

ZhenhangTung avatar Nov 24 '25 09:11 ZhenhangTung

Hi! I'm the product manager. Thanks for this feedback. Why do you want this feature? Could you share your use cases? That will help us fully understand your problem. @ZhenhangTung

Hello. We have some documents that changes the content inside frequently. And sometimes, we want to delete some files in data source because no needs for these documents.

During increamental synchronization, it's not easy to delete removed files. We're working on it. @KevinHuSh

Thanks for your work. By the way, I'm going to review this changes and test it in like 3-4 hours. PR11468

Furkan-Demir avatar Nov 24 '25 11:11 Furkan-Demir

I can't build PR11468 as docker image in macbook (mbp m2 Sequoia 15.6) and I don't have windows machine. My test cases will be these:

Case 1:

  • Upload a pdf (let's say a.pdf) to S3
  • Sync with Ragflow
  • overwrite a.pdf (or delete & upload with same name and same bucket path)
  • Expectation: in Ragflow, a.pdf is overwritten version. When I ask a question to Ragflow Chat, it gives me new information from new overwritten (deleted & uploaded) a.pdf version

Case 2:

  • Upload a pdf file to S3. (let's say b.pdf)
  • Sync with Ragflow
  • Delete b.pdf from S3
  • Expectation: in Ragflow, b.pdf is deleted and when I ask a question to Ragflow Chat, it does not provide any information from b.pdf

Could you do these tests and share us test results please? @Woody-Hu

Furkan-Demir avatar Nov 25 '25 09:11 Furkan-Demir

I actually had a requirements meeting with Siemens today, and they brought up the same requirement. Any updates or deletions in S3 or other data sources should also be reflected in RAGFlow. My current thought is to make this behavior configurable on the frontend, so developers can choose whether to enable it.

ZhenhangTung avatar Nov 25 '25 09:11 ZhenhangTung

imo incremental synchronization is not good for data source sync case. But I'm not contributor, so your opinions are matter, not mine :)

Furkan-Demir avatar Nov 25 '25 11:11 Furkan-Demir

imo incremental synchronization is not good for data source sync case. But I'm not contributor, so your opinions are matter, not mine :)

I don’t think so 🙂 As a product manager, understanding everyone’s perspective is one of the most important parts of my job. So your opinions absolutely carry weight for me.

ZhenhangTung avatar Nov 25 '25 11:11 ZhenhangTung

@ @ZhenhangTung @Furkan-Demir @KevinHuSh I have carefully reviewed your conversation and consulted with other users. I believe the best approach is to offer users two options for S3 file transfers: “incremental sync” and “full sync.” However, implementing this would increase the workload for developers. I hope my explanation is clear. Wish you all the best

goldengolden7981 avatar Dec 05 '25 02:12 goldengolden7981

@goldengolden7981 Yep. We need to offer an option so developers can choose the one that best fits their use case.

ZhenhangTung avatar Dec 05 '25 03:12 ZhenhangTung