azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Using indexers ingestion / Integrated Vectorization will not apply page number

Open daptatea opened this issue 1 year ago • 17 comments

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

By using Indexers to ingest from storage account, source page will not be added compared to using prepdocs.ps1. Is there a way to add source page with the indexers ingestion of pdf?

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

daptatea avatar Mar 06 '24 21:03 daptatea

cc @srbalakr @mattgotteiner I think I saw this as well- I was noticing my citation filenames were missing page numbers and wondered where they went to.

pamelafox avatar Mar 06 '24 22:03 pamelafox

Also seeing this. It looks to me like it is from the difference between the skillset document chunker.

The integrated vectorization does this where it maps source page to the filename in blob

        index_projections = SearchIndexerIndexProjections(
            selectors=[
                SearchIndexerIndexProjectionSelector(
                    target_index_name=index_name,
                    parent_key_field_name="parent_id",
                    source_context="/document/pages/*",
                    mappings=[
                        InputFieldMappingEntry(name="content", source="/document/pages/*"),
                        InputFieldMappingEntry(name="embedding", source="/document/pages/*/vector"),
                        InputFieldMappingEntry(name="sourcepage", source="/document/metadata_storage_name"),
                    ],
                ),
            ],

The original (non-integrated) maps the source page to the exact chunk within the sourcefile:

                        "sourcepage": (
                            BlobManager.blob_image_name_from_file_page(
                                filename=section.content.filename(),
                                page=section.split_page.page_num,
                            )
                            if image_embeddings
                            else BlobManager.sourcepage_from_file_page(
                                filename=section.content.filename(),
                                page=section.split_page.page_num,
                            )
                        ),

In the index, integrated leaves it looking like this:

      "sourcepage": "file.pdf",
      "sourcefile": null,

Whereas the other searchmanager.py leaves it looking like this;

      "sourcepage": "file-4.pdf",
      "sourcefile": "file.pdf"

jakebowles99 avatar Apr 09 '24 10:04 jakebowles99

Hi there @pamelafox @mattgotteiner I'm looking for a solution to this issue. Is there a way to get the chunk page using integrated vectorization? Specifically, I'm trying to ensure that source page numbers are included in the index?

Any guidance or suggestions on how to achieve this with the integrated vectorization approach would be greatly appreciated.

luixlacrux avatar Jul 25 '24 02:07 luixlacrux

Found these possible solutions, but both feel suboptimal to me

CICDamen avatar Aug 21 '24 14:08 CICDamen

I asked the AI Search team about this and got a few suggestions:

pamelafox avatar Sep 06 '24 18:09 pamelafox

are you planning in implementing it?

ogimgio avatar Sep 23 '24 15:09 ogimgio