terraform icon indicating copy to clipboard operation
terraform copied to clipboard

Terraform CLI: Option to specify maximum request rate

Open andersthorbeck opened this issue 3 years ago • 11 comments
trafficstars

Current Terraform Version

Terraform v1.1.9
on darwin_amd64
+ provider registry.terraform.io/hashicorp/azurerm v3.4.0

Use-cases

Successfully running terraform refresh and terraform apply on big states, even in the face of strict rate limits.

The issue I'm experiencing is that I have a state with a great number of resources of the same type (in this case > 500 private_dns_zone_virtual_network_link resources), and the Terraform provider rate limits requests pertaining to that resource type (in this case the Azure rate limit is 500 DNS operations per 5 minutes). I am now consistently hitting the rate limit on this state, leading terraform refresh to fail, and thus my IaC continuous deployment workflow to consistently fail.

I am seeking a way to limit the request rate made by Terraform against a certain known (to me) rate limit in a provider, such that terraform refresh and terraform apply CLI commands can still succeed even if the states contain so many resources that they would otherwise (for normal fast network connections) hit the rate limit and fail.

Attempted Solutions

One already attempted and insufficient solution is to pass the option -parallelism=1, effectively disabling concurrent requests against the provider. This helps, but even with sequential requests I am hitting the rate limit.

Another solution is to split the terraform state into several smaller, but still cohesive, terraform states. This has (in my case) already been done, and any further partitioning of the current state (which at this point already contains resources of only a single type) will lead to a poor and arbitrary state partitioning strategy. Still there are too many requests being made for this one cohesive terraform state.

A third solution is to literally throttle the network bandwidth of the computer performing the terraform refresh and terraform apply, in my case using the MacOS Network Link Conditioner (found here) on my local laptop (instead of on a cloud-based CI/CD-pipeline as would normally be the case). This works, but it's an extremely heavy handed way of "solving" the issue, as it slows down all software relying on internet connectivity (which obviously severely impacts my productivity), and it's a manual and very time-consuming procedure which doesn't lend itself to a CI/CD-pipeline.

A fourth "solution" is to actually remove resources. In my case, a number of these resources turned out to be superfluous and could be deleted, but in the general case this isn't really a solution per se. Even so, this solution alone is insufficient, because you still hit the rate limit when refreshing the state before you can even apply the changes to reduce the number of resources; I was forced to use the throttling solution in conjunction with this solution.

Proposal

I'm envisioning a solution where you could specify as an option to the terraform CLI commands (at least terraform refresh, terraform plan and terraform apply) the maximum request rate that the command should perform against the providers.

Something like

$ terraform refresh -max-request-rate-per-minute=100

which would ensure the request rate would never exceed 100 requests per minute.

Or potentially something which allows you to specify the rate limit as specified by the provider, something like

$ terraform refresh -max-request-rate=500/5m  # max 500 requests per 5 minutes
$ terraform refresh -max-request-rate=10/s  # max 10 requests per second
$ terraform refresh -max-request-rate=12000/h  # max 12000 requests per hour

which might give added flexibility to ensure that the request rate doesn't exceed e.g. 500 request in a 5 minute timespan, but might conceivably still perform 400 requests in the first minute, if fewer than 500 requests are required.

References

See my corresponding StackOverflow question.

Issue https://github.com/hashicorp/terraform/issues/28152 seems related, although the focus here seems more on specifying parallelism limitations inline. That would not be a sufficient solution in this case, as I've already disabled parallelism entirely, and the rate limit throttling is caused by the sum total of all resources in the state, not just the resources defined in a single or a few (for_each-ed) resource blocks.

andersthorbeck avatar May 20 '22 13:05 andersthorbeck

Thanks for this thorough issue report!

crw avatar May 20 '22 22:05 crw

Thanks for sharing this use-case, @andersthorbeck!

I think one thing we'll want to research a bit before moving forward here is whether the sort of constraint Terraform Core itself can insert would be sufficient to meet the need.

Specifically, Terraform Core can only control when it makes the requests to the providers, and cannot control what providers do in response to those requests. A single request to a provider could potentially make multiple requests to the remote system, or make none at all, depending on how the provider is built and how the remote API is designed.

The most robust answer would be for a provider itself to perform the throttling, since it could therefore do so at the level of the real API calls that the rate limits are applying to and therefore throttle itself even when a particular provider action requires multiple API calls, and not throttle itself when it's only doing local work that isn't subject to rate limiting. A provider can also potentially bring awareness of the specific rate limit strategy of a particular remote API and therefore do something more dynamic, such as monitoring a response header that describes the rate limit status so it can take other concurrent operations (outside of Terraform) into account too.

We should also try to understand whether this need warrants a global throttle over all requests, a per-provider throttle, or something else in between. Making it more precise than just a global throttle could allow things to complete faster in situations where multiple providers are mixed in a single configuration, which often arises at least with the "utility" providers that just perform local computation.

apparentlymart avatar May 23 '22 14:05 apparentlymart

I know this issue is about the cli implementation but I feel that the issue should be addressed via configuration language. Many issue about rate limiting I found is about an for_each:

resource "slack_usergroup" "groups" {
  for_each = local.groups

Having an instruction to sleep and relax maybe the most flexibility choice here.

resource "sleep" "sleep_and_relax" {
  duration = 250
}

resource "slack_usergroup" "groups" {
  for_each = local.groups
  for_each_sleep_duration = 100

The for_each_sleep_duration can be later integrate into the provider sdk as a hint.

nvcnvn avatar May 25 '22 09:05 nvcnvn

You raise good points @apparentlymart. I had not considered that Terraform core might not have knowledge of how many API calls are triggered on the provider-side by the various Terraform operations. I agree that they can take more apt decisions on where and how to throttle the API calls than Terraform Core alone can. From that point of view, it makes sense to outsource this functionality to the provider, as an ideal-world solution.

However, I suspect that for many providers it will realistically take a long time to implement awareness of own rate limits for all their corresponding APIs, and that certain providers may never prioritize this. Thus, from a pragmatic point of view, I believe that it may still be worth considering at least a hybrid approach here, where a global (i.e. Terraform Core) throttle option is available, in case the provider has not internally implemented throttling matching own API rate-limiting. It would be more work overall for both Terraform Core and the individual Providers to implement their own support for this, but it does seem to me to be the most flexible pragmatic solution, where a global throttle becomes a last-ditch measure, but at least a possible one.

andersthorbeck avatar May 25 '22 14:05 andersthorbeck

Hi again @andersthorbeck. Thanks for confirming!

Since Terraform Core is already a pretty complicated codebase with many interacting features, our usual definition of "pragmatism" is unfortunately kinda the opposite of your framing here: everything should be implemented in providers unless there's a technical reason why it cannot be, or if it would behave better when implemented in Terraform Core. Anything in Terraform Core must necessarily be more complicated in order to meet many possible use-cases at once, whereas something in a provider can focus only on the needs of that particular target system. Rate limiting strategies vary between systems, so I expect it will be simpler to implement a throttle focused on one particular target system rather than a generic solution that works well for everything.

Given that your original framing here suggested that this was related to Azure in particular -- and, if I'm reading right, possibly for one specific Azure service that happens to have an aggressive rate limit -- I would suggest starting by sharing this use-case in the Azure provider repository. You might be right that the Azure provider team would find this less important to implement than other things in the provider, but that is a decision for that team to make, and isn't sufficient reason to implement something in Terraform Core where we also need to carefully prioritize our work to ensure we're implementing the most impactful things, and that we can continue to maintain all of the many core features efficiently in future.

If the Azure provider team does find that this would be more practical to implement with some help from Terraform Core features then we can have that discussion, but I'd like to let them take the lead on determining that. Within the bounds of the provider protocol today in principle a provider could:

  • If necessary, offer optional arguments in the provider configuration to control rate limits. (I expect that would be necessary only if these rate-limits vary significantly between customers and cannot be discovered automatically via the API.)
  • Either configure the existing platform SDK client to block and retry when a rate limit is exceeded, if the SDK offers such a feature, or wrap the SDK client with something new that does. From Terraform Core's perspective, this would just look like a particular request taking longer, which Terraform Core should tolerate just fine.
  • If it seems likely that rate limiting will sometimes cause a long delay, consider also supporting cancellation via the Stop message to immediately halt backoff/retry and return an error, so that Ctrl+C will work to interrupt the long-running operation.

For Azure in particular, since Azure already seems to have well-documented patterns for handling rate limits and throttling, hopefully the official clients offer mechanisms for implementing what's described on that page, but the Azure Go SDK is large and apparently spread over several codebases, so I'll have to leave the Azure provider team to determine whether this is something they can "just enable" in the existing SDK or something the provider itself would need to implement.

apparentlymart avatar May 25 '22 18:05 apparentlymart

I've just been looking at the same issue for managing Azure DNS (with a rate limit of 500 reads per 5 minutes).

The rate limit documentation at https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling seems to say the limit is per Principal ID, so a fifth option for the OP might be to create multiple azurerm providers in the same terraform project with different underlying principals and manually balance which resources are managed by which provider.

It's not ideal, but if it works it would avoid the need to break the resources up into separate projects.

It's probably something for the individual providers to implement, but if there were an option to specify multiple credentials for a single provider and have it transparently round-robin them it would be a simple way to work around rate limits :-).

mikeclayton avatar Mar 14 '23 14:03 mikeclayton

This should not be just on provider level. We need to have global rate limiter as well, which also should affect planning.

adubkov avatar May 15 '24 22:05 adubkov

Hi @adubkov! Thanks for that feedback.

Can you say more about why you need a global rate limit?

So far we've only seen examples where a particular platform or particular service has an unusually-onerous rate limit, in which case a global rate limit isn't needed. I assume you have some other situation in mind when you give this feedback, so I'd like to understand more about it.

Thanks!

apparentlymart avatar May 15 '24 22:05 apparentlymart

@apparentlymart Probably for the same reason parallelism option already exists - which only partly addresses rate limiting issues. If we have that then why not extending it? Each provider may have different limits but rate limiting is a common thing among all providers. It's a global parameter. So why not having a global rate limit? We are talking not only provider but backend statefile calls if I understand it correctly. It would be nice if instead of limiting with parallelism we could just specify what the rate limits are accordingly to our needs, with the sleep time needed. But for sure, this more of a nice to have, yet would speed up plan\apply operations in azurerm for sure.

mtrin avatar May 22 '24 04:05 mtrin