terraform icon indicating copy to clipboard operation
terraform copied to clipboard

Terraform performance with large number of resources

Open zeridon opened this issue 8 years ago • 14 comments

Hi Team,

I have the following configuration:

  • one VPC
  • 6 different networks inside the VPC
  • 2 nat gw
  • 36 opsworks stacks
  • 300 or so security groups
  • 1200 or opswork layers

In total about 1600 resources in my configs.

Creation of this whole config in aws took about 30 min - most of the time could probably be attributed to waiting and connection handling

Plan with a simple change (e.g. adding 6 new layers) takes around 8 min. Probably most of the time can be attributed to connection handling and waiting, but there is around a minute or so after the remote state is retrieved that terraform stands there doing magic.

Memory usage is 2+ GB

The state file i have (remotely stored in s3) is 8.6mb

Some quick observations:

2017/10/17 16:22:51 [INFO] command: backend initialized: *s3.Backend
2017/10/17 16:22:51 [DEBUG] [aws-sdk-go] DEBUG: Response s3/GetObject Details:
2017/10/17 16:22:54 [INFO] terraform: building graph: GraphTypeInput
2017-10-17T16:23:01.016+0300 [DEBUG] plugin: starting plugin: path=/home/zeridon/work/repos/cloud-infrastructure/region-terraform/.terraform/plugins/linux_amd64/terraform-provider-aws_v1.0.0_x4 args=[/home/zeridon/work/repos/cloud-infrastructure/region-terraform/.terraform/plugins/linux_amd64/terraform-provider-aws_v1.0.0_x4]
2017-10-17T16:23:21.446+0300 [DEBUG] plugin.terraform-provider-aws_v1.0.0_x4: 2017/10/17 16:23:21 [INFO] Building AWS region structure

I see also a large number of the following sequence (even from the start and not due to timeouts or exceeded limits).

2017-10-17T16:23:22.116+0300 [DEBUG] plugin.terraform-provider-aws_v1.0.0_x4: 2017/10/17 16:23:22 [DEBUG] [aws-sdk-go] DEBUG: Request sts/GetCallerIdentity Details:
2017-10-17T16:23:22.803+0300 [DEBUG] plugin.terraform-provider-aws_v1.0.0_x4: 2017/10/17 16:23:22 [DEBUG] [aws-sdk-go] DEBUG: Response sts/GetCallerIdentity Details:
2017-10-17T16:23:22.805+0300 [DEBUG] plugin.terraform-provider-aws_v1.0.0_x4: 2017/10/17 16:23:22 [DEBUG] [aws-sdk-go] DEBUG: Request iam/GetUser Details:
2017-10-17T16:23:23.554+0300 [DEBUG] plugin.terraform-provider-aws_v1.0.0_x4: 2017/10/17 16:23:23 [DEBUG] [aws-sdk-go] DEBUG: Response iam/GetUser Details:
2017-10-17T16:23:23.555+0300 [DEBUG] plugin.terraform-provider-aws_v1.0.0_x4: 2017/10/17 16:23:23 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeAccountAttributes Details:
2017-10-17T16:23:24.208+0300 [DEBUG] plugin.terraform-provider-aws_v1.0.0_x4: 2017/10/17 16:23:24 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeAccountAttributes Details:
$ egrep 'Response ec2/DescribeAccountAttributes' terraform-debug.log | wc -l
539

Usually NOT followed by anything that can be easily associated with a call to AWS api to retrieve some info.

The questions i have:

  • Is this memory usage expected
  • What is using this time after state download and can it be optimized (debug log can be provided)
  • The sequence descriped in the second code block repeats 10 times at a time (that corresponds to -parallelism=10) but why the auth info is not kept between invocations.

zeridon avatar Oct 17 '17 13:10 zeridon

Hi @zeridon,

Thanks for the use case here. Could you provide some more details about your configuration, like how many "provider" blocks you have in your config, are you using modules, and what is the layout, etc?

These are known issues, and optimization with large configurations is something we do want to tackle.

2Gb of memory usage is not unexpected, especially with the AWS provider in which the sdk has fairly hefty client instances for each service.

The time after retrieving the state where terraform is "doing magic" is likely a problem with the graph transitive reduction, which I have fixed in master.

The re-authentication by the provider may be the lowest hanging fruit here. The provider does often needs to be reconfigured, and dong it every time isn't really an issue for most configurations so we haven't had the need to optimize that yet.

If you haven't tried, with this many resources you may be able to speed things up considerably with a higher -parallelism setting, since the fastest this could possibly complete with the default is 160*fastest_resource_type. The setting of 10 is somewhat arbitrary, and we haven't really pushed it to see how high is reasonable.

jbardin avatar Oct 19 '17 02:10 jbardin

Hi @jbardin ,

Sanitized project can be found at: https://gist.github.com/zeridon/9cd3fa1c85cbc2d72c4914b58c7b03cf

Short answers to your main questions:

  • How many provider blocks - 1 just in main.tf to set aws region
  • Modules - yes
  • layout - https://gist.github.com/zeridon/9cd3fa1c85cbc2d72c4914b58c7b03cf#file-dir-tree

Dependency tree is:

main.tf
 |- aws-vpc
 |- sg-infra
   |- eip-shortcuts
     |- eip
 |- stacks
   |- stack
customer-xxxx.tf
 |- customer-layer (sg, 1 or more opsworks layers in different stacks)

The sanitization of the project is:

  • removed custom names
  • removed custom IP's
  • simplified representation of objects (e.g. only one rule per sec group, only one sample file per customer etc.)

On your observations:

  • Indeed it seems the graph generation is quite heavy - seems to take about 15-20 seconds.
  • Reauths seem really the low fruit here
  • -parallelism=10 seems to be okayish as with 20/30 i run in ratelimits. (No haven't looked into backoff strategies for it)

A thing i forgot. This was observed with terraform 0.9.5 and 0.10.7. Terraform 0.10.7 is a bit less on the memory side (around 300 mb or so if i recall correctly)

zeridon avatar Oct 19 '17 08:10 zeridon

@jbardin do you have any suggestions on how we can improve the runtimes of our Terraform plans? We also have a large number of resources and it seems to takes several minutes just to refresh state. A TF plan takes ~15 minutes to run on Terraform Enterprise.

ktham avatar Oct 17 '18 22:10 ktham

@ktham,

Several minutes to refresh a large number of resources is not unexpected, the api calls to check the state of each resource can only happen so quickly. One option you have to to try carefully increasing the -parallelism value to run more api calls at a time.

jbardin avatar Oct 26 '18 20:10 jbardin

That's understandable, though I would expect TF should be able to issue and handle quite a lot of API calls at once even on a single core machine.

Ok, I'll look into the -parallelism value, though I'm not sure if I can set that because we use Terraform Enterprise

ktham avatar Oct 26 '18 22:10 ktham

I ran some benchmarks:

Config

  • 61 resources
  • Using VPN connection that is decent
  • 2017 MacBook Pro 13"
  • Using --auto-approve

TF Destroy

  • Command: terraform destroy --auto-approve --parallelism 10 terraform/
  • Default: 45s
  • Parallelism 10: 53s

TF Apply

  • Command: terraform destroy --auto-approve --parallelism 10 terraform/
  • Default: 1m43s
  • Parallelism 10: 1m46s

I guess my project doesn't benefit from parallelism?

Is there any way to show a nice dependency graph? I found terraform graph but it's not very readable. I can't get terraform show to work either. terraform show [options] [path] -> terraform show terraform/ doesn't work. All my tf files are in a subdirectory called terraform and my .terraform folder is one level up from the terraform folder.

Clete2 avatar Nov 07 '18 12:11 Clete2

Hi @Clete2,

10 is the default parallelism, so you were only seeing variance within the api calls themselves.

The graph command is really the only way to show the actual dependency graph, but I also have yet to find a good method for viewing large graphs in general.

jbardin avatar Nov 07 '18 16:11 jbardin

Judging by another thread, the parallelism really depends on the number of cores in a system, which is kind of ridiculous for a program that spends most of the time waiting for API calls: https://github.com/hashicorp/terraform/issues/11766

nielsole avatar Jan 04 '19 09:01 nielsole

@nielsole,

No, the number of concurrent calls is not effected by the number of cores in the system. It's only artificially limited by the -parallelism value because there is no universally "good" value for all providers and configurations. Note that this is all unrelated to the original issue here, which is the efficiency of large graph graph transformations, which has nothing to do with concurrent provider operations.

jbardin avatar Jan 04 '19 16:01 jbardin

Terraform seems to spend a lot of its time computing these dependency graphs - I am curious why a dependency graph of just a few hundred nodes should take several minutes to compute. Is there are whitepaper or discussion I can read somewhere?

ToonSpinISAAC avatar Apr 18 '19 08:04 ToonSpinISAAC

Hi @ToonSpinISAAC,

The issues in this thread are mainly concerned with the parallelization of the api calls to resources, which is capped at 10 by default. Very large numbers of resources can take a long time to process when there are a large number of slow API calls.

A graph of a few hundred nodes should not take minutes to compute. The primary graph related performance issue is in highly connected graphs, and can be referenced in issue #18981.

jbardin avatar Apr 22 '19 12:04 jbardin

I suspect a lot more people have ran into this issue than just the 10 people who have +1'd it.

When Terraform plans got too slow I broke our terraform codebase up into multiple separate states with the help of Terragrunt. I suspect other people have gone down the same path, either with Terragrunt or separate git repositories.

This has helped, but even with multiple states we still run into performance issues when one state grows large.

calebAtIspot avatar Jul 26 '24 00:07 calebAtIspot

We realized our code base was too big. For some projects we broke it up. For others we used more tf variables and used multiple state files to deploy different parts. Generally I try to keep under 200 resources per deployment.

Clete2 avatar Jul 26 '24 02:07 Clete2

We realized our code base was too big.

This is shorthand for "too big for all currently available releases of terraform to handle". Name another language that has superlinear slow down when there are "too many" objects.

This should be fixed, regardless of whether its workarounds have become best practice in the meantime.

bukzor avatar Jul 31 '24 11:07 bukzor