azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Using indexers ingestion / Integrated Vectorization will not apply page number
Please provide us with the following information:
This issue is for a: (mark with an x)
- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
By using Indexers to ingest from storage account, source page will not be added compared to using prepdocs.ps1. Is there a way to add source page with the indexers ingestion of pdf?
Any log messages given by the failure
Expected/desired behavior
OS and Version?
Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
azd version?
run
azd versionand copy paste here.
Versions
Mention any other details that might be useful
Thanks! We'll be in touch soon.
cc @srbalakr @mattgotteiner I think I saw this as well- I was noticing my citation filenames were missing page numbers and wondered where they went to.
Also seeing this. It looks to me like it is from the difference between the skillset document chunker.
The integrated vectorization does this where it maps source page to the filename in blob
index_projections = SearchIndexerIndexProjections(
selectors=[
SearchIndexerIndexProjectionSelector(
target_index_name=index_name,
parent_key_field_name="parent_id",
source_context="/document/pages/*",
mappings=[
InputFieldMappingEntry(name="content", source="/document/pages/*"),
InputFieldMappingEntry(name="embedding", source="/document/pages/*/vector"),
InputFieldMappingEntry(name="sourcepage", source="/document/metadata_storage_name"),
],
),
],
The original (non-integrated) maps the source page to the exact chunk within the sourcefile:
"sourcepage": (
BlobManager.blob_image_name_from_file_page(
filename=section.content.filename(),
page=section.split_page.page_num,
)
if image_embeddings
else BlobManager.sourcepage_from_file_page(
filename=section.content.filename(),
page=section.split_page.page_num,
)
),
In the index, integrated leaves it looking like this:
"sourcepage": "file.pdf",
"sourcefile": null,
Whereas the other searchmanager.py leaves it looking like this;
"sourcepage": "file-4.pdf",
"sourcefile": "file.pdf"
Hi there @pamelafox @mattgotteiner I'm looking for a solution to this issue. Is there a way to get the chunk page using integrated vectorization? Specifically, I'm trying to ensure that source page numbers are included in the index?
Any guidance or suggestions on how to achieve this with the integrated vectorization approach would be greatly appreciated.
Found these possible solutions, but both feel suboptimal to me
I asked the AI Search team about this and got a few suggestions:
- Through AI Document intelligence, using a custom skill, they would just need to modify this to not only use this vs. splitting the docs with split skill, but they can get the page number as required: azure-search-vector-samples/demo-python/code/indexers/document-intelligence-custom-skill/document-intelligence-custom-skill.ipynb at main · Azure/azure-search-vector-samples (github.com)
- Through their own custom skill, using the library/method of their preference to extract the data, they could use GPT-4o as data extraction method with a custom skill and change the prompt to retrieve the page too. They can use this sample code for the extraction and adapt the outputs for a custom skill: liamca/GPT4oContentExtraction: Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents to Markdown (github.com)
are you planning in implementing it?