PhiCookBook 429 too many requests

Hi Team,

I currently deployed Phi3.5 model both as serverless endpoint and on custom compute. But even when I hit bit more requests it's throwing error 429. When I contacted the support they informed me that I need to change the max_conccurent_requests https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-deployment-managed-online?view=azureml-api-2#requestsettings

But unfortunately I can't find an option to edit that config anywhere as model is hosted by MSFT. Any help would be greatly appreciated

Dec 12 '24 17:12 jmandivarapu

Hi Have you deployed the model via Azure AI Foundry https://ai.azure.com or Azure ML Studio https://ml.azure.com

It sounds like you're running Azure Machine Learning endpoints. The max_concurrent_requests setting is indeed crucial for handling higher request loads. Here are a few steps you can take:

Check Quotas and Limits: Ensure that your subscription has the necessary quotas and limits for the number of concurrent requests. You can view and request quota increases through the Azure portal.

Azure Support Case: Since the model is hosted by Microsoft, you might need to open a support case with Azure to request an increase in the max_concurrent_requests setting. This is often the recommended approach when you can't directly modify the configuration. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas?view=azureml-api-2

The Azure documentation provides additional insights or workarounds. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas?view=azureml-api-2#endpoint-quota-increases

Dec 12 '24 20:12 leestott

Hi @leestott I made the deployment through both the Azure ML Studio and Azure AI Foundry.

Case 1: In Azure AI Foundry or Azure AI Studio I can only deploy the Serverless version of the Phi family. In the Customization I can only see Model Version and Resource Location. But I don't see any option for max_conccurent_requests. But I am facing the same issue 429 for the serverless endpoints.

Case 2: Yes for the model deployed using the AML. I did raise a request for the Quota Increase. But unfortunately AML quota team after long delay's still wanted me to explore the option of max_conccurent_requests.

I am not sure how to proceed from here. The only painful option is host the model on the custom container by downloading the weights. This will lead me to create environment and score file where I can specify some of the parameters max_conccurent_requests. But big downside of approach is when ever Microsoft releases a new version of weight. I need to do redeployment.

Any assistance or suggestion or feedback is greatly appreciated.

Dec 13 '24 16:12 jmandivarapu

@jmandivarapu have you explored Autoscale Feature: Azure AI Foundry offers an autoscale feature that can automatically adjust the rate limits based on real-time usage metrics. This might help alleviate the 429 errors by increasing the rate limit dynamically. You can enable this feature through the Azure portal. See https://learn.microsoft.com/en-us/azure/ai-studio/how-to/autoscale

Re Custom Container Deployment: While redeploying the model every time there's a new version of the weights is cumbersome, it does give you control over parameters like max_concurrent_requests. If you decide to go this route, consider automating the deployment process using CI/CD pipelines to reduce the manual effort involved.

Jan 23 '25 09:01 leestott