azure-search-openai-demo Using indexers ingestion / Integrated Vectorization will not apply page number

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

By using Indexers to ingest from storage account, source page will not be added compared to using prepdocs.ps1. Is there a way to add source page with the indexers ingestion of pdf?

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

Mar 06 '24 21:03 daptatea

cc @srbalakr @mattgotteiner I think I saw this as well- I was noticing my citation filenames were missing page numbers and wondered where they went to.

Mar 06 '24 22:03 pamelafox

Also seeing this. It looks to me like it is from the difference between the skillset document chunker.

The integrated vectorization does this where it maps source page to the filename in blob

        index_projections = SearchIndexerIndexProjections(
            selectors=[
                SearchIndexerIndexProjectionSelector(
                    target_index_name=index_name,
                    parent_key_field_name="parent_id",
                    source_context="/document/pages/*",
                    mappings=[
                        InputFieldMappingEntry(name="content", source="/document/pages/*"),
                        InputFieldMappingEntry(name="embedding", source="/document/pages/*/vector"),
                        InputFieldMappingEntry(name="sourcepage", source="/document/metadata_storage_name"),
                    ],
                ),
            ],

The original (non-integrated) maps the source page to the exact chunk within the sourcefile:

                        "sourcepage": (
                            BlobManager.blob_image_name_from_file_page(
                                filename=section.content.filename(),
                                page=section.split_page.page_num,
                            )
                            if image_embeddings
                            else BlobManager.sourcepage_from_file_page(
                                filename=section.content.filename(),
                                page=section.split_page.page_num,
                            )
                        ),

In the index, integrated leaves it looking like this:

      "sourcepage": "file.pdf",
      "sourcefile": null,

Whereas the other searchmanager.py leaves it looking like this;

      "sourcepage": "file-4.pdf",
      "sourcefile": "file.pdf"

Apr 09 '24 10:04 jakebowles99

Hi there @pamelafox @mattgotteiner I'm looking for a solution to this issue. Is there a way to get the chunk page using integrated vectorization? Specifically, I'm trying to ensure that source page numbers are included in the index?

Any guidance or suggestions on how to achieve this with the integrated vectorization approach would be greatly appreciated.

Jul 25 '24 02:07 luixlacrux

Found these possible solutions, but both feel suboptimal to me

Aug 21 '24 14:08 CICDamen

I asked the AI Search team about this and got a few suggestions:

Through AI Document intelligence, using a custom skill, they would just need to modify this to not only use this vs. splitting the docs with split skill, but they can get the page number as required: azure-search-vector-samples/demo-python/code/indexers/document-intelligence-custom-skill/document-intelligence-custom-skill.ipynb at main · Azure/azure-search-vector-samples (github.com)
Through their own custom skill, using the library/method of their preference to extract the data, they could use GPT-4o as data extraction method with a custom skill and change the prompt to retrieve the page too. They can use this sample code for the extraction and adapt the outputs for a custom skill: liamca/GPT4oContentExtraction: Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents to Markdown (github.com)

Sep 06 '24 18:09 pamelafox

are you planning in implementing it?

Sep 23 '24 15:09 ogimgio

azure-search-openai-demo azure-search-openai-demo copied to clipboard

Using indexers ingestion / Integrated Vectorization will not apply page number

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

azure-search-openai-demo
azure-search-openai-demo copied to clipboard

This issue is for a: (mark with an `x`)