Full Doc mode document can't be retrieved after unarchive
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
1.1.0
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
- Create a knowledge base, let's say: KB1
- Upload a document file (attached example file
FullDoc_UnArchive_Retrieve_bug_demo.md
), click Next to Chunk Settings page.
3. Select Parent-child - Full Doc - High quality - Hybrid Search, click Save & Process to Documents page
4. Go to Retrieval Testing page, test "Introduce dify", the uploaded md file is retrieved in result section.
5. Back to Documents page, select the document and click Archive, then 'UnArchive'
6. Repeat step 4, no file retrieved.
✔️ Expected Behavior
After UnArchive, mark down file should be retrieved.
❌ Actual Behavior
After UnArchive, mark down file wasn't retrieved.
Hello, @efreykongcn! I'm here to help you with bug fixes, questions, and contributions. Let's work on resolving the issue you've encountered.
The issue you're experiencing with the "Full Doc" mode not retrieving documents after unarchiving might be related to how Dify handles document states. When a document is unarchived, it should become eligible for retrieval as long as it is not archived and is enabled [1]. However, there might be a specific issue with the "Full Doc" mode or the unarchiving process in version 1.1.0 that is preventing the document from being retrieved.
Unfortunately, I couldn't find specific information about known issues with the "Full Doc" mode in the current version. It might be helpful to check if there are any logs or error messages generated during the unarchiving process that could provide more insight into why the document is not being retrieved. Additionally, reviewing the retrieval settings and ensuring that the document is correctly processed after unarchiving might help resolve the issue.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
According to my investigation, in source file dify/api/models/dataset.py child_chunks property line 710, due to the restriction rules.parent_mode != ParentMode.FULL_DOC, dify doesn't load child_chunks for full doc mode document, so the child_chunks can't be indexed when performing unarchive. If remove this restriction, then retrieve test works for unarchived full doc file. I'm wondering what this restriction is designed for.
child_chunks definition in dstaset.py file:
@property
def child_chunks(self):
process_rule = self.document.dataset_process_rule
if process_rule.mode == "hierarchical":
rules = Rule(**process_rule.rules_dict)
if rules.parent_mode and rules.parent_mode != ParentMode.FULL_DOC:
child_chunks = (
db.session.query(ChildChunk)
.filter(ChildChunk.segment_id == self.id)
.order_by(ChildChunk.position.asc())
.all()
)
return child_chunks or []
else:
return []
else:
return []