azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Issue while trying to have integration vectorization enabled.

Open shivam10u opened this issue 10 months ago • 5 comments

Hi @pamelafox , I have been trying to use Integrated vectorization but after the deployment only search index is getting created even after enabling "azd env set USE_FEATURE_INT_VECTORIZATION true" , Please help me I see that the code is capable of it but still this issue.

PFA- image image

My aim is very simple -

  1. Get the files directly uploaded in the blob and no need to run prepdocs.py 2, Have multi-format document supported.

shivam10u avatar Apr 08 '24 09:04 shivam10u

It says that the index has 2,161 documents in it, so it did index something. Or was that from running prepdocs.py before? You should see logs from prepdocs.py that describe the process of setting up the integrated vectorization, please share those as well.

pamelafox avatar Apr 11 '24 12:04 pamelafox

@pamelafox - I have a query on this topic, I was not sure to raise an issue for something I have questions about. So using this open thread. Please suggest if this is not

So I ran for a few days with all these options enabled until I discovered this option Integrated Vectorization

So I followed the documentation and enabled it. Regarding the quality of results, what difference can I expect when Integrated Vectorization is enabled and when it is not and I use the below options?

image

chetan2309 avatar Apr 11 '24 19:04 chetan2309

There are some differences between local prepdocs ingestion and integrated vectorization, specifically:

  • Azure AI search doesn't use Document Intelligence for cracking. (It may use similar technology behind the scenes, but it may also differ).
  • Azure AI search may have a slightly different text splitting algorithm. It currently doesn't take into account tokens, it just splits based on character count/sentence boundaries. It should be functionally the same for English text, but I wouldn't recommend for CJK languages at this time.
  • Azure AI search doesn't currently note the page number, according to issues filed here.

If you do see lower quality due to the cracking or splitting algorithm, please write up your findings so that the search team may make improvements as necessary. Thanks!

pamelafox avatar Apr 11 '24 20:04 pamelafox

Hi @pamelafox, I have currently integration vectorization enabled in my code and it is running fine, but as you mentioned I am not able to see page number in the index. Are you planning to implement it in the future by any chance and how this integration vectorization pipeline approach is better than previous approach.

dchandu320 avatar May 15 '24 13:05 dchandu320

That feature would need to be implemented in the Azure AI Search internal code, not in this repo itself. The Azure AI Search team does not have a public ETA for the feature, but are aware of the need for it.

pamelafox avatar May 15 '24 23:05 pamelafox