azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Indexing is slowing down by a factor of 10x after the first few runs
This issue is for a: (mark with an x
)
- [X] documentation issue or request
Minimal steps to reproduce
Create this project with a large document database of a mix of Office files (docx, xlsx, pptx). Our blob storage contains 7500 documents and is 25 GB. Upload files directly to the blob container without the script. Then run the prepdocs.ps1 script to start indexing and create a schedule to restart indexing after the 120 min timeout.
Any log messages given by the failure
Here you can see that indexing started with a bang with about a 1000 docs for the first two 2h runs. Then it gradually slowed-down to less than 100 docs per run. Any idea why?
Expected/desired behavior
Constant indexing speed.
OS and Version?
Windows 11
azd version?
azd version 1.6.1 (commit eba2c978b5443fdb002c95add4011d9e63c2e76f)
Hm, I assume you've read through https://learn.microsoft.com/en-us/azure/search/search-limits-quotas-capacity#indexer-limits since you mention doing 120minute runs. Do you have any pause between the 120minute runs?
Yes I have a 10 min pause. The scheduler is every 130 minutes.
Now that I think about it, perhaps I did not select the proper options in the scheduler settings? I'm not quite sure why I even had to select anything. Is there such a thing as "run with the same options than when using the prepdocs script"?
This is with integrated vectorization, I assume? The prepdocs sets up the indexer with certain properties, but it doesn't setup a schedule. Perhaps a pause of 10 minutes is not sufficient. @srbalakr Can you advise?
With integrated vectorization, correct. So the manual schedule settings are likely different than the settings that are used with prepdocs ? Perhaps it's safer for now to set a manual process on my PC to call prepdocs repeatedly, every 150 min or so?
Would it be simpler to add an optional feature to prepdocs to enable the scheduler? I might look into that.
So I tried to create a new index from scratch to avoid the potential issues with the portal config for the scheduler. I decided to instead, create a powershell script to call prepdocs.ps1 every 150 minutes. This time I also increased the limit of the embedding model to 240K TPM. The first indexing indeed went much faster, processing 3700 documents in 2 hours. But all the subsequent indexing failed. There is no error in the command line from the script execution but some errors in the portal. I'm attaching a few screenshots. Any idea what is going on?
First failed run (the other ones have the same messages)
I have neither the error nor the warning in the first successful run.
After looking at the warning again, I'm wondering if the fact that I have enabled "hierarchical namespace" in the storage could be the issue? I had to create the storage manually because of a security policy that doesn't allow public network access on storage, hence the bicep provisioning failed.