feat: Add page_number attribute to document segments and update related retrieval logic
Summary
This feature was first introduced in #7749, but then reverted since it has a bug in #8211 . Since then, there are couple of issues asking to reintroduce the feature.
The problem with the original feature in #7749 is that, it did not consider the case that not all document has page number info, such as txt, md files. It also added the page number attribute along with the embedding, which is not the most natural way to store the page number.
In this pull request, the page number is added to Document Segment, which requires a change in database schema. The page number is retrieved as part of the meta data.
Resolves #8502 Resolves #11891
Screenshots
| Before | After |
|---|---|
Checklist
[!IMPORTANT]
Please review the checklist below before submitting your pull request.
- [ ] This change requires a documentation update, included: Dify Document
- [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
- [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
- [x] I've updated the documentation accordingly.
- [x] I ran
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods
Hi, do you have any updates about this feature being merged? Thank you.
Not sure, the maintainer does not respond at all.
@cpwan Could you resolve the conflicts, I will let @JohnJyong take a look at this.
i will try
@crazywoola @JohnJyong Ready for review
Multi-page
pdf:
docx
the docx format may not give page number info, so it is always 0.
Single page
txt
docx
I will take a look at this later