opentofu icon indicating copy to clipboard operation
opentofu copied to clipboard

429 Too Many Requests when using Azure as remote state backend

Open paulblum00 opened this issue 9 months ago • 13 comments

Community note

[!TIP] 👋 Hi there, OpenTofu community! The OpenTofu team prioritizes issues based on upvotes. Please make sure to upvote this issue and describe how it affects you in detail in the comments to show your support.

OpenTofu Version

v1.10.3 on darwin_arm64

OpenTofu Configuration Files

terraform {
  backend "azurerm" {
    container_name       = "DEDUCTED"
    key                  = "*path relative to include*/terraform.tfstate"
    resource_group_name  = "DEDUCTED"
    storage_account_name = "DEDUCTED"
    subscription_id      = "DEDUCTED"
  }
}

Debug Output

  ╷
  │ Error: Failed to get existing workspaces: Error retrieving keys for Storage Account "DEDUCTED": storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="TooManyRequests" Message="The request is being throttled as the limit has been reached for operation type - Read_ObservationWindow_00:05:00. For more information, see - https://aka.ms/srpthrottlinglimits"
  │ 
  │ 
  ╵

Expected Behavior

Tofu should manage to make requests to Azure in a way, which doesn't lead to the execution erroring because of 429 responses.

Actual Behavior

Tofu returns an error. This doesn't happen when using Terraform instead of Tofu with the exact same configuration.

Steps to Reproduce

It happens when using a big infrastructure in terragrunt with an azure storage account as the remote state backend. It seems as though the amount of requests made to Azure to fetch the .tfstate for every Tofu module leads to the 429. Apparently terraform has some form of protection against it, that tofu is missing.

Additional Context

We're using terragrunt and configured ${path_relative_to_include()}/terraform.tfstate as the key for the remote state. Maybe this could be relevant, as it leads to Tofu requesting a different file for every execution. However, it might also be unrelated because Tofu would probably do the same amount of requests per instance, even if terragrunt always configured the same remote config.

It should also be mentioned that this doesn't happen all the time, but often enough to cause a lot of trouble. I tried using terragrunt init --all on an infrastructure with ~400-500 modules and it happens on maybe 30% of tries. In those cases however, it completely screws the execution, resulting in sometimes even almost all units failing.

References

No response

paulblum00 avatar Jul 21 '25 07:07 paulblum00

Hi @paulblum00, thanks for your report!

Indeed, we introduced a change when getting blobs from Azure, which are used to populate the backends.

This PR was created to solve this issue. So it seems you're getting rate limited by Azure at some point, and their rate limit is not that high: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling

What's strange for me though, is that we're querying by the workspaces prefix: https://github.com/opentofu/opentofu/blob/329a6a6adb3efc03ac116e2e02c99c42872c9f93/internal/backend/remote-state/azure/backend_state.go#L169-L205

So we should be doing only one query if you have only 400-500 workspaces. I'm going to read more about Azure and see what would be a better suggestion here.

In the meantime, can you check for logs like this: [TRACE] Getting page %d of blob results and send to us?

diofeher avatar Jul 22 '25 15:07 diofeher

Hey @paulblum00 , I took some time to look at our code and I could understand better what's happening:

You're being rate limited possibly at one of these two points:

https://github.com/opentofu/opentofu/blob/dbe2111be71a21a3d155ebbcf2f641d092224bdb/internal/backend/remote-state/azure/arm_client.go#L168-L170

https://github.com/opentofu/opentofu/blob/dbe2111be71a21a3d155ebbcf2f641d092224bdb/internal/backend/remote-state/azure/arm_client.go#L215-L218

If you can give us more logs on this operation, it would be helpful to understand where the bottleneck is.

diofeher avatar Jul 22 '25 16:07 diofeher

@diofeher I have successfully generated a log file with tf log level set to trace for a run of terragrunt plan --all, which ran into the 429 error. However, it is over 1GB in size and I'm not sure whether it contains confidential information, so I would prefer to send it to you in private.

I couldn't get any results for grep "Getting page" though. Is it possible that the log entry has a different wording?

paulblum00 avatar Jul 24 '25 09:07 paulblum00

I think that grep is probably sufficient. If the pagination is not being activated, it's more likely an issue with all of the state requests from the multiple concurrent invocations.

I'm curious if there's something we can do in tofu init to remove the need for the workspace list call.

cam72cam avatar Jul 24 '25 11:07 cam72cam

Hmm that's strange, because we would see this log even if you're only doing one request: https://github.com/opentofu/opentofu/blob/329a6a6adb3efc03ac116e2e02c99c42872c9f93/internal/backend/remote-state/azure/backend_state.go#L176

You would see: "Getting page 0 of blob results"

diofeher avatar Jul 24 '25 11:07 diofeher

@diofeher I just realized this has nothing to do with object storage 🤦 It's failing before we even get to that step.

The call to ListKeys getBlobClient and getContainersClient is checking the access key. We don't cache the client so every call to a state function that uses either makes additional calls!

cam72cam avatar Jul 24 '25 11:07 cam72cam

@cam72cam Hey, sorry to interrupt, but I think this issue isn't done yet. There seems to be a another 429 error. Not sure if I should open a new issues, as this technically is a different error with the same error code.

I ran terragrunt run-all init on the same big infrastructure with the updated version of tofu and it ran into the following error:

  ╷
  │ Error: Failed to get existing workspaces: Error retrieving keys for Storage Account "REDACTED": storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="SubscriptionRequestsThrottled" Message="Number of requests for subscription 'REDACTED' and operation 'POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.STORAGE/STORAGEACCOUNTS/LISTKEYS' exceeded the backend storage limit. Please try again after '6' seconds."
  │ 
  │ 
  ╵

This one doesn't seem to mess up the entire run as much, because apparently if suffices to wait a few seconds and retry. Thus I can probably fix it on my end by adding it as a retryable error, but maybe it should be treated as one by default.

paulblum00 avatar Aug 13 '25 07:08 paulblum00

I should have added a 429 handler while I was in there. Let me discuss this with the core team during triage and get back to you.

cam72cam avatar Aug 18 '25 20:08 cam72cam

The core maintainers discussed this more during our triage meeting and concluded that it would not really be appropriate to respond to the "429 Too Many Requests" error code by making more requests unless the response also includes information about how long the client should wait before making another request (e.g. the Retry-After header field) and the client waits at least that long before retrying.

The documentation suggests that Azure Resource Manager does return Retry-After as part of a 429 error response:

When you reach the limit, you receive the HTTP status code 429 Too many requests. The response includes a Retry-After value, which specifies the number of seconds your application should wait before sending the next request. If you send a request before the retry value elapses, your request isn't processed and a new retry value is returned.

The documentation also currently links to a page apparently titled "Retry guidance for Azure services", but the target of that link no longer seems to exist and instead redirects to a generic page about service reliability.

With all of this said: it seems reasonable to add special treatment for the 429 error code when the response also includes Retry-After where the client will sleep until the time has elapsed and then retry the operation. If we do that then we should presumably have an upper limit on both the total number of requests we will attempt and the total elapsed time we'll keep retrying for, but we'll need to select those limits heuristically.

We don't expect to be able to improve this any further beyond retrying when the server directs us to.

apparentlymart avatar Aug 25 '25 17:08 apparentlymart

This morning I once again got the original error

  ╷
  │ Error: Failed to get existing workspaces: Error retrieving keys for Storage Account "saopspspokewest001": storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="TooManyRequests" Message="The request is being throttled as the limit has been reached for operation type - Read_ObservationWindow_00:05:00. For more information, see - https://aka.ms/srpthrottlinglimits"
  │ 
  │ 
  ╵

I guess this happens when different machines are making requests to the same remote state simultaneously, so reducing the amount of local requests can only do so much to mitigate it.

I can totally see that it's difficult to really fix this. All I can say is that Terraform seems to be able to prevent this error. We have been using it with the same infra for years and never ran into this error. Now we decided to make the final switch to tofu some days ago and already had a pipeline fail because of this. Hope that is enough info for your team to decide on the priority of the issue.

Also maybe it would be more efficient to fix it in cooperation with terragrunt? I don't know how these programs work under the hood, but in our case it seems like a single terragrunt call results in 400 requests to the same storage account because every tofu instance fetches it's respective statefile. Shouldn't it be possible for terragrunt to manage download of statefiles in batches? Or am I vastly underestimating the complexity of that hypothetical feature?

paulblum00 avatar Aug 26 '25 07:08 paulblum00

@paulblum00 It should be noted that there is an ongoing refactor of the Azure backend that will likely resolve some of these issues: https://github.com/opentofu/opentofu/pull/3034

diofeher avatar Aug 26 '25 11:08 diofeher

To everyone interested in this thread: @lbordowitz has done great work refactoring the Azure backend, and it is now on main: https://github.com/opentofu/opentofu/pull/3034

Would you be willing to test with the upstream code to see if you still experience those issues?

diofeher avatar Sep 04 '25 20:09 diofeher

@paulblum00 OpenTofu 1.11-beta is out with a newly-rewritten azurerm backend. Could you give it a try and see if you're still facing this issue?

lbordowitz avatar Oct 28 '25 20:10 lbordowitz