terraform icon indicating copy to clipboard operation
terraform copied to clipboard

`terraform test`: dump the complete state file when a destroy operation fails.

Open lorengordon opened this issue 1 year ago • 15 comments

Terraform Version

terraform 1.6

Use Cases

I'm interested in using the test command, in the upcoming terraform 1.6 release. We currently have a similar workflow using the tftest pytest library. One thing we've found useful is that tftest leaves the state behind when the destroy action fails. This gives us the opportunity to investigate what happened, update the config or permissions if needed, and re-execute the destroy to cleanup any lingering resources.

Currently, the test command maintains state only in memory. I am curious what this means for destroy-time errors. Do the resources just remain lingering? Is there any way to inspect the failure and cleanup?

Attempted Solutions

n/a

Proposal

Perhaps on error, write the state to a file?

References

No response

lorengordon avatar Sep 01 '23 19:09 lorengordon

Hi @lorengordon, thanks for the feedback.

Currently, terraform test will provide a set of diagnostics that describes any state left behind if the destroy action fails. I do like the idea of potentially writing out a file. I will investigate this, and see how practical it is.

Thanks again!

liamcervante avatar Sep 04 '23 06:09 liamcervante

I've considered this a bit further and I will say that we won't see any changes here for the launch of v1.6. The test command at the moment doesn't have the ability to recover or use state files, so printing out the state file won't really give additional information over the diagnostics that are already produced. It's also not straightforward how Terraform should handle state files from alternate modules and/or registry modules and this needs more thought. We'd want to do this properly instead of trying to rush something in place for v1.6.

I did quickly implement #33809 which means the test command will return a non-zero return code if the destroy operation fails, making it easier to detect as a fail case.

I will leave this issue open though, as we are investigating approaches to custom state management within test files and it may be that with fine-grained state management we can actually use any dumped state files to make the tidy up process easier. Potentially, we'll see this arriving in Terraform v1.7.

liamcervante avatar Sep 07 '23 09:09 liamcervante

I suppose I was figuring I'd be able to use the state file locally, with just terraform destroy. But there are probably some details of how test works that I haven't considered.

lorengordon avatar Sep 07 '23 18:09 lorengordon

Yeah, I think that works for the very simple case. But it gets complicated with edge cases as the test command could be tearing down infrastructure that it has created based on other modules that are held in the registry, terraform test could be executing within a CI pipeline or in TFC, there could already be state files in the local directory already.

I really like this idea as a feature, but I'd want to do it properly and realistically that means giving it a full development cycle, and as a bonus that means we can make it work with the other state operations we've had as suggestions which just means it'll be even better when it does launch. All changes we make have to be backwards/forwards compatible and I wouldn't want to lock this out of future integrations because we just rushed and solved the simple case.

liamcervante avatar Sep 07 '23 19:09 liamcervante

Oh yes, I don't think it's that urgent for us. I honestly wasn't expecting an implementation of this request for the 1.6 release. Sorry if I wasn't clear on that. Really appreciate the work and attention you've put into the question. Thanks again!

lorengordon avatar Sep 07 '23 19:09 lorengordon

Will there be an intention on this, when you have to manually destroy ~30 aws resources terraform test stop being easy to use

albertorm95 avatar Jan 03 '24 12:01 albertorm95

Hi @albertorm95, terraform test should already be attempting to destroy the resources it creates - is there a particular reason that this isn't working for you?

I'd imagined what we implemented here would be about making a rare edge case easier to recover from rather than something that would be used as part of a regular test cycle. Ideally, Terraform would successfully clean up on every test execution rather than it failing and dumping the state file every time.

Maybe you could post your use case into the forums, and we could try and get the destroy operation to execute successfully?

liamcervante avatar Jan 03 '24 12:01 liamcervante

Hello! @liamcervante

What about an scenario where the test is executed on a remote system and for any reason the system failed before destroying or in the destroy process, those resources will be there without a state that represent them

If there is state file on the test we will able to reference it and execute a manual destroy 🤔

albertorm95 avatar Jan 03 '24 12:01 albertorm95

Hi @albertorm95, I agree that there are valid use cases for this. It's a question of priority for us internally, so I was curious if you had a use case for why you need the state file regularly or if we could work to ensure that the destroy operation simply doesn't fail for you?

I'll try and explain why I view this as a lower priority than you perhaps do. As far as I'm aware, the state file would only be required in rare cases when the destroy operation fails - regardless of the environment that Terraform is executing in. If the destroy operation is failing regularly enough that the terraform test command becomes unusable, I'd rather focus on improving the reliability of the destroy operation. I don't think that expecting users to execute a manual terraform destroy regularly is a particularly valid solution to an unreliable clean up operation. It does make the required manual cleanup easier but ideally we'd avoid the manual cleanup altogether.

liamcervante avatar Jan 03 '24 13:01 liamcervante

I see, I agree with you on this as a lower priority, this happened to me with and AWS error:

Terraform encountered an error destroying resources created while executing tests/test.tftest.hcl/apply_foo.
╷
│ Error: waiting for ECS Service (arn:aws:ecs:eu-central-1:1234567:service/foo-foo-test-fra/foo-master-test) delete: RequestError: send request failed
│ caused by: Post "https://ecs.eu-central-1.amazonaws.com/": read tcp IP:PORT->ANOTHERIP:ANOTHERPORT: read: can't assign requested address
│ 
│ 
╵

Terraform left the following resources in state after executing tests/test.tftest.hcl/apply_foo, and they need to be cleaned up manually:
  - aws_acm_certificate.foo
  - aws_cloudwatch_log_group.foo["centralized"]
  - aws_cloudwatch_log_group.foo["master"]
  - aws_cloudwatch_log_group.foo["remotestate"]
  - aws_cloudwatch_log_group.foo["simple"]
  - aws_ecs_cluster.foo
  - aws_ecs_service.foo["master"]
  - aws_ecs_task_definition.foo["centralized"]
  - aws_ecs_task_definition.foo["master"]
  - aws_ecs_task_definition.foo["remotestate"]
  - aws_ecs_task_definition.foo["simple"]
  - aws_iam_role.foo["centralized"]
  - aws_iam_role.foo["master"]
  - aws_iam_role.foo["remotestate"]
  - aws_iam_role.foo["simple"]
  - aws_lb_listener_rule.foo["centralized"]
  - aws_lb_listener_rule.foo["master"]
  - aws_lb_listener_rule.foo["remotestate"]
  - aws_lb_listener_rule.foo["simple"]
  - aws_lb_target_group.foo["centralized"]
  - aws_lb_target_group.foo["master"]
  - aws_lb_target_group.foo["remotestate"]
  - aws_lb_target_group.foo["simple"]
  - aws_secretsmanager_secret.foo["centralized"]
  - aws_secretsmanager_secret.foo["master"]
  - aws_secretsmanager_secret.foo["remotestate"]
  - aws_secretsmanager_secret.foo["simple"]
  - random_id.suffix
  - module.foo_kms_key.aws_kms_alias.this
  - module.foo_kms_key.aws_kms_key.this
  - module.foo_lb.aws_lb.lb
  - module.foo_lb.aws_lb_listener.https_forward
  - module.foo_lb.aws_lb_target_group.target_group
  - module.foo_lb_sg.aws_security_group.sg
  - module.foo_sg.aws_security_group.sg
tests/test.tftest.hcl... fail

What about in this list append the id of the resource: module.foo_sg.aws_security_group.sg : ID

that will make it easier for the user to find and destroy the resource, WDYT?

albertorm95 avatar Jan 03 '24 13:01 albertorm95

Thanks for the context @albertorm95!

A quick note on the ID idea, which we did experiment with. Unfortunately, the idea of an id attribute is not constant throughout the Terraform provider ecosystem. Some cloud providers do assign a unique global identifier to their resources, and then some Terraform providers do choose to expose that ID but it's not consistent. For example, the AWS provider uses the arn attribute as the global unique identifier for many resources and doesn't expose anything called id. Terraform core (where the test command operates) doesn't know which attribute a given provider is using as an id so it's not aware of what attribute it should provide. We have had talks internally about creating some kind of global id attribute that Terraform can use where appropriate, so it maybe we can use that when/if it becomes available.

For your use case, would you mind posting into the forums? That way I can dig into it with you there without sending notifications to everyone subscribed to this Github issue. I think any discussion about the format of the output, or potential improvements should stay in this ticket. But for specific use cases it's better to move into the forums.

Thanks!

liamcervante avatar Jan 03 '24 13:01 liamcervante

Hey @lorengordon (and others in this issue), the Terraform team is doing research into this problem, and I'd love to chat to learn more. Please reach out to me [email protected] and we can schedule a time to chat!

omarismail avatar Feb 20 '24 14:02 omarismail

👍 +1 for this issue - this is basically a blocker for us using terraform test - we've been integrating basic apply testing into our module release pipelines, but keep hitting issues where manually cleanup of the test resources is needed.. in the meantime we will have to hand craft the equivalent tests using the normal terraform lifecycle.. would very much like to be able to use remote state with terraform test

tomhaynes avatar Apr 29 '24 16:04 tomhaynes

We have also had this problem, specifically around permissions in AWS. If the role being used isn't created right, the delete permissions might be missing which causes the test to fail to cleanup the resources. We try to limit the permission our testing role has and we've had a lot of problems when its missing some of those permissions. Having the state saved to a remote backend (we use TFC) would be ideal and allow us to manually queue the deletions without losing what resources were created. Having just the resource name and not the IDs in the output doesn't really help us much either, we still have to go hunt down what it actually created.

richardsonky avatar Apr 30 '24 12:04 richardsonky

Unless breakout automation handles the state file or does something with it - might it be a better idea to output the state resource IDs as well as the terraform resources?

That way we have the option of passing the ids to an API based cleanup function as well, or handle it manually should the situation call for it

deitChi avatar May 16 '24 10:05 deitChi