azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Indexing is slowing down by a factor of 10x after the first few runs

Open DuboisABB opened this issue 11 months ago • 6 comments

This issue is for a: (mark with an x)

- [X] documentation issue or request

Minimal steps to reproduce

Create this project with a large document database of a mix of Office files (docx, xlsx, pptx). Our blob storage contains 7500 documents and is 25 GB. Upload files directly to the blob container without the script. Then run the prepdocs.ps1 script to start indexing and create a schedule to restart indexing after the 120 min timeout.

Any log messages given by the failure

Here you can see that indexing started with a bang with about a 1000 docs for the first two 2h runs. Then it gradually slowed-down to less than 100 docs per run. Any idea why? slow_indexing

Expected/desired behavior

Constant indexing speed.

OS and Version?

Windows 11

azd version?

azd version 1.6.1 (commit eba2c978b5443fdb002c95add4011d9e63c2e76f)

DuboisABB avatar Mar 07 '24 13:03 DuboisABB

Hm, I assume you've read through https://learn.microsoft.com/en-us/azure/search/search-limits-quotas-capacity#indexer-limits since you mention doing 120minute runs. Do you have any pause between the 120minute runs?

pamelafox avatar Mar 07 '24 17:03 pamelafox

Yes I have a 10 min pause. The scheduler is every 130 minutes.

DuboisABB avatar Mar 07 '24 18:03 DuboisABB

Now that I think about it, perhaps I did not select the proper options in the scheduler settings? I'm not quite sure why I even had to select anything. Is there such a thing as "run with the same options than when using the prepdocs script"? image

DuboisABB avatar Mar 07 '24 18:03 DuboisABB

This is with integrated vectorization, I assume? The prepdocs sets up the indexer with certain properties, but it doesn't setup a schedule. Perhaps a pause of 10 minutes is not sufficient. @srbalakr Can you advise?

pamelafox avatar Mar 07 '24 19:03 pamelafox

With integrated vectorization, correct. So the manual schedule settings are likely different than the settings that are used with prepdocs ? Perhaps it's safer for now to set a manual process on my PC to call prepdocs repeatedly, every 150 min or so?

Would it be simpler to add an optional feature to prepdocs to enable the scheduler? I might look into that.

DuboisABB avatar Mar 07 '24 19:03 DuboisABB

So I tried to create a new index from scratch to avoid the potential issues with the portal config for the scheduler. I decided to instead, create a powershell script to call prepdocs.ps1 every 150 minutes. This time I also increased the limit of the embedding model to 240K TPM. The first indexing indeed went much faster, processing 3700 documents in 2 hours. But all the subsequent indexing failed. There is no error in the command line from the script execution but some errors in the portal. I'm attaching a few screenshots. Any idea what is going on? image First failed run (the other ones have the same messages) image image

I have neither the error nor the warning in the first successful run.

After looking at the warning again, I'm wondering if the fact that I have enabled "hierarchical namespace" in the storage could be the issue? I had to create the storage manually because of a security policy that doesn't allow public network access on storage, hence the bicep provisioning failed.

DuboisABB avatar Mar 08 '24 20:03 DuboisABB