dify icon indicating copy to clipboard operation
dify copied to clipboard

feat: Add page_number attribute to document segments and update related retrieval logic

Open cpwan opened this issue 10 months ago • 5 comments

Summary

This feature was first introduced in #7749, but then reverted since it has a bug in #8211 . Since then, there are couple of issues asking to reintroduce the feature.

The problem with the original feature in #7749 is that, it did not consider the case that not all document has page number info, such as txt, md files. It also added the page number attribute along with the embedding, which is not the most natural way to store the page number.

In this pull request, the page number is added to Document Segment, which requires a change in database schema. The page number is retrieved as part of the meta data.

Resolves #8502 Resolves #11891

Screenshots

Before After
image image

Checklist

[!IMPORTANT]
Please review the checklist below before submitting your pull request.

  • [ ] This change requires a documentation update, included: Dify Document
  • [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • [x] I've updated the documentation accordingly.
  • [x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

cpwan avatar Feb 14 '25 09:02 cpwan

Hi, do you have any updates about this feature being merged? Thank you.

neyec avatar May 01 '25 13:05 neyec

Not sure, the maintainer does not respond at all.

cpwan avatar May 01 '25 23:05 cpwan

@cpwan Could you resolve the conflicts, I will let @JohnJyong take a look at this.

crazywoola avatar May 14 '25 07:05 crazywoola

i will try

cpwan avatar May 14 '25 08:05 cpwan

@crazywoola @JohnJyong Ready for review

Multi-page

pdf:

image

docx

the docx format may not give page number info, so it is always 0. image

Single page

txt

image

pdf

image

docx

image

cpwan avatar May 18 '25 08:05 cpwan

I will take a look at this later

crazywoola avatar May 21 '25 07:05 crazywoola