terraform icon indicating copy to clipboard operation
terraform copied to clipboard

Significant performance degradation with Terraform 1.3.x and local state

Open danischm opened this issue 3 years ago • 2 comments

Terraform Version

Terraform v1.3.3

Terraform Configuration Files

https://github.com/danischm/tf-perf-test

Debug Output

2022-10-21T13:06:59.469+0200 [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
2022-10-21T13:06:59.567+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 29
2022-10-21T13:06:59.567+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2022-10-21T13:06:59.639+0200 [TRACE] vertex "null_resource.test3": visit complete
2022-10-21T13:06:59.640+0200 [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
2022-10-21T13:06:59.737+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 30
2022-10-21T13:06:59.738+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2022-10-21T13:06:59.804+0200 [TRACE] vertex "null_resource.test0": visit complete
2022-10-21T13:06:59.805+0200 [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
2022-10-21T13:06:59.900+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 31
2022-10-21T13:06:59.900+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2022-10-21T13:06:59.968+0200 [TRACE] vertex "null_resource.test6": visit complete

Expected Behavior

The second terraform apply should be completed in less than a second (using the provided configuration) with no changes to the local state file. This is the observed behavior with TF <1.3.0.

Actual Behavior

With TF 1.3.0+ we can see that the local state file is rewritten once per every resource in the configuration even though there are no changes. This is also visible in the local state file where serial gets incremented by the number of resources. Compared to TF <1.3.0 where the second no-op terraform apply takes less than a second, with TF 1.3.0+ the second terraform apply takes 3+ minutes to complete. Increasing the number of resources and/or the size of the state file worsens the issue.

Steps to Reproduce

  1. terraform apply -auto-approve -refresh=false
  2. terraform apply -auto-approve -refresh=false

Additional Context

A git bisect has revealed the following 'bad' commit: https://github.com/hashicorp/terraform/commit/72dd14ca5c023d32bed64ac887275fbd9d3569a9

References

No response

danischm avatar Oct 21 '22 14:10 danischm

Thanks for reporting this, @danischm.

I expect that the root cause here is that the local state implementation doesn't implement the state storage API as documented:

The state storage API distinguishes between creating a new state snapshot in memory vs. persisting the latest snapshot to durable storage. Terraform Core repeatedly calls the first of these operations during its work, but calls the "persist" operation much less often on the assumption that it is significantly more expensive.

Unfortunately the local backend treats the first operation as a request to persist on local disk and the second to be a no-op. We've been aware of this for some time but have not acted to correct it because writing to local disk is generally relatively faster than accessing an API over the network and local state is primarily for initial development until someone is ready to activate a proper state storage mechanism.

However, we can see here that the change in treatment of "no-op" operations had caused Terraform Core to call the "update snapshot in memory" operation significantly more often than before when there are many resource instances, which makes the local state storage design flaw more significant.

As a short-term fix for the v1.3 series I expect we can lightly modify the behaviour to skip creating a new in-memory snapshot for "no-op" changes, although we will need to watch out for the special situation where the resource instance itself isn't changing but its precondition and postcondition results are changing: in that case we should still record the updated check results table.

It might also be time to fix this long-standing design flaw in the local state implementation so it implements equivalent behavior to all of the other storage implementations. Then we are less likely to be caught out by local-storage-specific regressions in future work.

Thanks again!

apparentlymart avatar Oct 21 '22 15:10 apparentlymart

Terraform does try to prevent writing state with no changes. Taking a quick look at the state output during the apply process, it seems we are alternating empty checks from null to [], which is causing each NoOp instance to trigger a full state write. State values try to normalize whether they expect empty vs null values to prevent these types of changes, so there's probably something which can be changed at that layer too.

jbardin avatar Oct 21 '22 16:10 jbardin

I can confirm that this fixes the issue. Thanks for the quick turnaround!

danischm avatar Nov 01 '22 21:11 danischm

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Dec 02 '22 02:12 github-actions[bot]