azure-search-openai-demo
azure-search-openai-demo copied to clipboard
How to use indexer for creating index that also creates sections of an individual pdf page using azure blob storage
Please provide us with the following information:
This issue is for a: (mark with an x)
- [ ] bug report -> please search issues before submitting
- [ X ] feature request
- [ X ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
Use indexer to create/update index instead of pushing content via rest apis. Need an document reference to create chunks using indexer.
OS and Version?
Windows 11
azd version?
azd version 1.1.0.
Thanks! We'll be in touch soon.
Did anyone has any idea if an azure indexer can be used for pushing data from blob storage to cognitive search index?
I tried creating indexer and mapped it with azure blob storage and I was able to transfer data(content) to index using indexer but I don't know how I can create multiple sections of a blob (a single page of pdf file) and attach this chunking process with indexer, so indexer can do all these heavy lifting over here [whenever there is a change in blob it automatically creates chunks of that blob and update that in index].
I can create multiple sections (having ~1000 character each) of a page and store those sections into index via rest api but I want to do it using indexer, so please let me know if you have any idea on this.
@pamelafox, Do you have any suggestion on this ?
@sandeeppatidar30 at the moment, indexers don't support chunking. An indexer can only go from one input document, to one entry in your Cognitive Search index. To overcome this current limitation, you could use one indexer that chunks your data and stores this in a blob storage via the knowledge store capability. Another indexer will index this new blob storage and index your chunks.
See https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb.
@iMicknl - Conceptually, does MS consider Doc Chunking and the experimentation of the sizes therein a feature that needs to be added to Cog Search? Or, is the concept as a whole a current failure of the LLM/context window limit and once mega context limits are available this is an issue that won't need solving anymore..? Speculation on the feature? @pamelafox . And thanks for your contributions, BTW.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.
Hi @sandeeppatidar30
I know its a late response. I believe what you're looking for can be achieved through integrated vectorization. A (relatively) recent PR has incorporated this feature into the demo. See here.
My understanding is that it achieves extraction, chunking, embedding and indexing of your data in blob storage - so i.e., from storage to index seamlessly.
But still GPT Vision is then not supported afaik
Thanks for opening this. Since this was filed, the sample added Integrated Vectorization, which enables indexing directly from Azure Blob Storage with built‑in extraction, chunking, embedding, and indexing (no manual prepdocs). Please see the Integrated Vectorization section in docs/data_ingestion.md and setup guidance in docs/deploy_features.md. Azure AI Search has since added a built-in skill with multimodal support, but that isn't yet used by our integrated vectorization setup.
To support all of the features of this repository fully, we are going to make an Azure Function that runs prepdocs.py and set that up as a custom skill for an indexer. Stay tuned to the repo to find out when that will be available.