dify PDF page number is absent in knowledge retrieval

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.2

Cloud or Self Hosted

Cloud

Steps to reproduce

When I create a knowledge from PDF files, I would expect to find page number present in the meta since the #7749 is merged. However, I can't find "page" in the meta data, as shown below. Do I missing anything to "enable" that feature?

    {
      "metadata": {
        "_source": "knowledge",
        "position": 1,
        "dataset_id": "f392217b-73ae-421f-8fea-207cafa9c926",
        "dataset_name": "Krav med raadstext...",
        "document_id": "0d7ba751-0ffd-4c81-8efa-be1d1c8e14ba",
        "document_name": "Krav med raadstexter TRVINFRA-00304 Plankorsningar v15_0.pdf",
        "document_data_source_type": "upload_file",
        "segment_id": "621a7608-9da0-4dac-8473-f644e9db8f35",
        "retriever_from": "workflow",
        "score": 0.32195564148658923,
        "segment_hit_count": 2,
        "segment_word_count": 2408,
        "segment_position": 60,
        "segment_index_node_hash": "4f1a192d11f56fa99fd1cb5f9a0b50ca25db417dccd32e035a0499d8aadb3da3"
      },

✔️ Expected Behavior

I expect to find page number present in the meta data from retrieval.

❌ Actual Behavior

key "page" is missing

Sep 17 '24 12:09 fniu

Hey, @fniu! I'm here to help you with bugs, questions, and becoming a contributor. Let's work on solving this issue together!

I found a similar unsolved discussion that might be relevant to your issue: Get the page number of a PDF when fetching document segments ^[1].

Regarding your issue, the function responsible for extracting and adding the page number to the metadata is parse within the PdfExtractor class. This function is called within the load method of the PdfExtractor class. The page number is accessed and added to the metadata via metadata = {"source": blob.source, "page": page_number} ^[2].

Here is the relevant code snippet:

def parse(self, blob: Blob) -> Iterator[Document]:
    """Lazily parse the blob."""
    import pypdfium2

    with blob.as_bytes_io() as file_path:
        pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
        try:
            for page_number, page in enumerate(pdf_reader):
                text_page = page.get_textpage()
                content = text_page.get_text_range()
                text_page.close()
                page.close()
                metadata = {"source": blob.source, "page": page_number}
                yield Document(page_content=content, metadata=metadata)
        finally:
            pdf_reader.close()

If the page number is not appearing in the metadata, it might be due to a configuration issue or a bug. Ensure that the PdfExtractor class is being used correctly and that the metadata is being processed and stored as expected.

If you have verified that the PdfExtractor is correctly implemented and the issue persists, it might be worth checking if there are any specific configurations or flags that need to be enabled to include the page number in the metadata. Unfortunately, I couldn't find specific information on this within the repository ^[2].

If the problem continues, you might want to follow up on the related discussion or create a new issue for further investigation.

To continue talking to Dosu, mention @dosu.

Sep 17 '24 12:09 dosubot[bot]

it's reverted.

Sep 18 '24 05:09 soulzzz

Hi, @fniu. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

You reported missing page number metadata after creating knowledge from PDF files.
The issue might be linked to the PdfExtractor class and its parse function.
@soulzzz mentioned that the feature was reverted, possibly explaining the missing metadata.

Next Steps

Please confirm if this issue is still relevant to the latest version of the Dify repository. If so, you can keep the discussion open by commenting here.
If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

Nov 25 '24 16:11 dosubot[bot]

This issue is still relevant.

Nov 28 '24 16:11 fniu

It seems that the page number was removed in https://github.com/langgenius/dify/issues/8211 . Wonder if there is any chance to bring it back? @JohnJyong

Like, don't need to include the page in https://github.com/langgenius/dify/blob/ee342063d8aaaadbc8ccf2c3db8898709b0a0641/api/core/rag/datasource/vdb/vector_factory.py#L31 so that it won't break other parts. Yet, still include the page info in the metadata if it exists, like what the https://github.com/langgenius/dify/pull/7749 did.

Dec 12 '24 14:12 cpwan

Hi, @fniu. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

Bug reported in Dify version 0.8.2: missing page number in metadata when creating knowledge from PDFs.
I explained the PdfExtractor class and suggested a configuration issue or bug.
@soulzzz mentioned the feature was reverted, possibly explaining the missing metadata.
@cpwan suggested a method to reintroduce the page number without affecting other system parts.
You confirmed the issue remains relevant.

Next Steps

Please confirm if this issue is still relevant to the latest version of Dify. If so, you can keep the discussion open by commenting here.
If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

Jan 12 '25 16:01 dosubot[bot]

@dosu

This issue should not be closed... the request of page number attribute keeps arising from the issues.

https://github.com/langgenius/dify/issues/11891

Feb 03 '25 06:02 cpwan

The removal of the 'page' field from the metadata in the Dify project, as discussed in issue #8211, seems to have been a point of contention. The PdfExtractor class originally included the 'page' field in the metadata, but this was later removed, possibly due to changes in the vector_factory.py file or other considerations. The request to reinstate the 'page' information without affecting existing functionality is ongoing, as seen in the discussions and related issues.

To address this, you might consider proposing a solution that allows the 'page' information to be included conditionally or as an optional feature. This way, it can be included in the metadata when needed without disrupting other parts of the system. Reviewing the implementation in pull request #7749, which successfully included the 'page' metadata, could provide insights into how this can be achieved without breaking changes.

This is a tough nut to crack, could really use your help on this, @takatost.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Feb 03 '25 06:02 dosubot[bot]

@dosu the original implementation considers only the semantic embedding case, yet there are now keyword search in the economical index method.

The page number information should not store with the embedding vector, instead, it should store along with the chunk, just as the keywords.

@dosu please locate which part of code to be looked at

Feb 03 '25 07:02 cpwan