PDF page number is absent in knowledge retrieval
Self Checks
- [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
Dify version
0.8.2
Cloud or Self Hosted
Cloud
Steps to reproduce
When I create a knowledge from PDF files, I would expect to find page number present in the meta since the #7749 is merged. However, I can't find "page" in the meta data, as shown below. Do I missing anything to "enable" that feature?
{
"metadata": {
"_source": "knowledge",
"position": 1,
"dataset_id": "f392217b-73ae-421f-8fea-207cafa9c926",
"dataset_name": "Krav med raadstext...",
"document_id": "0d7ba751-0ffd-4c81-8efa-be1d1c8e14ba",
"document_name": "Krav med raadstexter TRVINFRA-00304 Plankorsningar v15_0.pdf",
"document_data_source_type": "upload_file",
"segment_id": "621a7608-9da0-4dac-8473-f644e9db8f35",
"retriever_from": "workflow",
"score": 0.32195564148658923,
"segment_hit_count": 2,
"segment_word_count": 2408,
"segment_position": 60,
"segment_index_node_hash": "4f1a192d11f56fa99fd1cb5f9a0b50ca25db417dccd32e035a0499d8aadb3da3"
},
✔️ Expected Behavior
I expect to find page number present in the meta data from retrieval.
❌ Actual Behavior
key "page" is missing
Hey, @fniu! I'm here to help you with bugs, questions, and becoming a contributor. Let's work on solving this issue together!
I found a similar unsolved discussion that might be relevant to your issue: Get the page number of a PDF when fetching document segments [1].
Regarding your issue, the function responsible for extracting and adding the page number to the metadata is parse within the PdfExtractor class. This function is called within the load method of the PdfExtractor class. The page number is accessed and added to the metadata via metadata = {"source": blob.source, "page": page_number} [2].
Here is the relevant code snippet:
def parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
import pypdfium2
with blob.as_bytes_io() as file_path:
pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
try:
for page_number, page in enumerate(pdf_reader):
text_page = page.get_textpage()
content = text_page.get_text_range()
text_page.close()
page.close()
metadata = {"source": blob.source, "page": page_number}
yield Document(page_content=content, metadata=metadata)
finally:
pdf_reader.close()
If the page number is not appearing in the metadata, it might be due to a configuration issue or a bug. Ensure that the PdfExtractor class is being used correctly and that the metadata is being processed and stored as expected.
If you have verified that the PdfExtractor is correctly implemented and the issue persists, it might be worth checking if there are any specific configurations or flags that need to be enabled to include the page number in the metadata. Unfortunately, I couldn't find specific information on this within the repository [2].
If the problem continues, you might want to follow up on the related discussion or create a new issue for further investigation.
To continue talking to Dosu, mention @dosu.
it's reverted.
Hi, @fniu. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.
Issue Summary
- You reported missing page number metadata after creating knowledge from PDF files.
- The issue might be linked to the
PdfExtractorclass and itsparsefunction. - @soulzzz mentioned that the feature was reverted, possibly explaining the missing metadata.
Next Steps
- Please confirm if this issue is still relevant to the latest version of the Dify repository. If so, you can keep the discussion open by commenting here.
- If there is no further activity, this issue will be automatically closed in 15 days.
Thank you for your understanding and contribution!
This issue is still relevant.
It seems that the page number was removed in https://github.com/langgenius/dify/issues/8211 . Wonder if there is any chance to bring it back? @JohnJyong
Like, don't need to include the page in https://github.com/langgenius/dify/blob/ee342063d8aaaadbc8ccf2c3db8898709b0a0641/api/core/rag/datasource/vdb/vector_factory.py#L31 so that it won't break other parts.
Yet, still include the page info in the metadata if it exists, like what the https://github.com/langgenius/dify/pull/7749 did.
Hi, @fniu. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.
Issue Summary
- Bug reported in Dify version 0.8.2: missing page number in metadata when creating knowledge from PDFs.
- I explained the
PdfExtractorclass and suggested a configuration issue or bug. - @soulzzz mentioned the feature was reverted, possibly explaining the missing metadata.
- @cpwan suggested a method to reintroduce the page number without affecting other system parts.
- You confirmed the issue remains relevant.
Next Steps
- Please confirm if this issue is still relevant to the latest version of Dify. If so, you can keep the discussion open by commenting here.
- If there is no further activity, this issue will be automatically closed in 15 days.
Thank you for your understanding and contribution!
@dosu
This issue should not be closed... the request of page number attribute keeps arising from the issues.
https://github.com/langgenius/dify/issues/11891
The removal of the 'page' field from the metadata in the Dify project, as discussed in issue #8211, seems to have been a point of contention. The PdfExtractor class originally included the 'page' field in the metadata, but this was later removed, possibly due to changes in the vector_factory.py file or other considerations. The request to reinstate the 'page' information without affecting existing functionality is ongoing, as seen in the discussions and related issues.
To address this, you might consider proposing a solution that allows the 'page' information to be included conditionally or as an optional feature. This way, it can be included in the metadata when needed without disrupting other parts of the system. Reviewing the implementation in pull request #7749, which successfully included the 'page' metadata, could provide insights into how this can be achieved without breaking changes.
This is a tough nut to crack, could really use your help on this, @takatost.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
@dosu the original implementation considers only the semantic embedding case, yet there are now keyword search in the economical index method.
The page number information should not store with the embedding vector, instead, it should store along with the chunk, just as the keywords.
@dosu please locate which part of code to be looked at