azure-search-vector-samples icon indicating copy to clipboard operation
azure-search-vector-samples copied to clipboard

Embedded Images Treatment

Open levalencia opened this issue 2 years ago • 1 comments

I was able to setup and run the Azure Open AI Text Embedding function, I also used the Ingestion sample to be able to create indexes, indexers, custom skillsets and datasources.

However my PDF documents might have embedded images, so I was wondering what happens in those cases?

My discoveries:

  1. Every page of each pdf is generated as an image and stored in the knowledge store. image

  2. When I check the content field on the index, I see many references to jpg files.

  3. I see a warning on all documents for images:

Can you please explain why this is happening and how to fix it?

image

  1. I also have warnings in some of the file chunks:

Can you please explain why this is happening and how to fix it?

image

levalencia avatar Aug 08 '23 13:08 levalencia

Have you configured your indexer to generate normalized images per page? Wanted to drop a pointer for this: https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-image-scenarios if it's useful...

You can try running a debug session on the portal (https://learn.microsoft.com/en-us/azure/search/cognitive-search-how-to-debug-skillset) to see the output produced and make sure it's actually conforming to the expected array type output

arv100kri avatar Aug 16 '23 17:08 arv100kri