azure-search-vector-samples
azure-search-vector-samples copied to clipboard
Embedded Images Treatment
I was able to setup and run the Azure Open AI Text Embedding function, I also used the Ingestion sample to be able to create indexes, indexers, custom skillsets and datasources.
However my PDF documents might have embedded images, so I was wondering what happens in those cases?
My discoveries:
-
Every page of each pdf is generated as an image and stored in the knowledge store.
-
When I check the content field on the index, I see many references to jpg files.
-
I see a warning on all documents for images:
Can you please explain why this is happening and how to fix it?
- I also have warnings in some of the file chunks:
Can you please explain why this is happening and how to fix it?
Have you configured your indexer to generate normalized images per page? Wanted to drop a pointer for this: https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-image-scenarios if it's useful...
You can try running a debug session on the portal (https://learn.microsoft.com/en-us/azure/search/cognitive-search-how-to-debug-skillset) to see the output produced and make sure it's actually conforming to the expected array type output