terraform
terraform copied to clipboard
`terraform test`: dump the complete state file when a destroy operation fails.
Terraform Version
terraform 1.6
Use Cases
I'm interested in using the test
command, in the upcoming terraform 1.6 release. We currently have a similar workflow using the tftest
pytest library. One thing we've found useful is that tftest leaves the state behind when the destroy action fails. This gives us the opportunity to investigate what happened, update the config or permissions if needed, and re-execute the destroy to cleanup any lingering resources.
Currently, the test
command maintains state only in memory. I am curious what this means for destroy-time errors. Do the resources just remain lingering? Is there any way to inspect the failure and cleanup?
Attempted Solutions
n/a
Proposal
Perhaps on error, write the state to a file?
References
No response
Hi @lorengordon, thanks for the feedback.
Currently, terraform test
will provide a set of diagnostics that describes any state left behind if the destroy action fails. I do like the idea of potentially writing out a file. I will investigate this, and see how practical it is.
Thanks again!
I've considered this a bit further and I will say that we won't see any changes here for the launch of v1.6
. The test command at the moment doesn't have the ability to recover or use state files, so printing out the state file won't really give additional information over the diagnostics that are already produced. It's also not straightforward how Terraform should handle state files from alternate modules and/or registry modules and this needs more thought. We'd want to do this properly instead of trying to rush something in place for v1.6
.
I did quickly implement #33809 which means the test command will return a non-zero return code if the destroy operation fails, making it easier to detect as a fail case.
I will leave this issue open though, as we are investigating approaches to custom state management within test files and it may be that with fine-grained state management we can actually use any dumped state files to make the tidy up process easier. Potentially, we'll see this arriving in Terraform v1.7
.
I suppose I was figuring I'd be able to use the state file locally, with just terraform destroy
. But there are probably some details of how test
works that I haven't considered.
Yeah, I think that works for the very simple case. But it gets complicated with edge cases as the test command could be tearing down infrastructure that it has created based on other modules that are held in the registry, terraform test could be executing within a CI pipeline or in TFC, there could already be state files in the local directory already.
I really like this idea as a feature, but I'd want to do it properly and realistically that means giving it a full development cycle, and as a bonus that means we can make it work with the other state operations we've had as suggestions which just means it'll be even better when it does launch. All changes we make have to be backwards/forwards compatible and I wouldn't want to lock this out of future integrations because we just rushed and solved the simple case.
Oh yes, I don't think it's that urgent for us. I honestly wasn't expecting an implementation of this request for the 1.6 release. Sorry if I wasn't clear on that. Really appreciate the work and attention you've put into the question. Thanks again!
Will there be an intention on this, when you have to manually destroy ~30 aws resources terraform test stop being easy to use
Hi @albertorm95, terraform test
should already be attempting to destroy the resources it creates - is there a particular reason that this isn't working for you?
I'd imagined what we implemented here would be about making a rare edge case easier to recover from rather than something that would be used as part of a regular test cycle. Ideally, Terraform would successfully clean up on every test execution rather than it failing and dumping the state file every time.
Maybe you could post your use case into the forums, and we could try and get the destroy operation to execute successfully?
Hello! @liamcervante
What about an scenario where the test is executed on a remote system and for any reason the system failed before destroying or in the destroy process, those resources will be there without a state that represent them
If there is state file on the test we will able to reference it and execute a manual destroy 🤔
Hi @albertorm95, I agree that there are valid use cases for this. It's a question of priority for us internally, so I was curious if you had a use case for why you need the state file regularly or if we could work to ensure that the destroy operation simply doesn't fail for you?
I'll try and explain why I view this as a lower priority than you perhaps do. As far as I'm aware, the state file would only be required in rare cases when the destroy operation fails - regardless of the environment that Terraform is executing in. If the destroy operation is failing regularly enough that the terraform test
command becomes unusable, I'd rather focus on improving the reliability of the destroy operation. I don't think that expecting users to execute a manual terraform destroy
regularly is a particularly valid solution to an unreliable clean up operation. It does make the required manual cleanup easier but ideally we'd avoid the manual cleanup altogether.
I see, I agree with you on this as a lower priority, this happened to me with and AWS error:
Terraform encountered an error destroying resources created while executing tests/test.tftest.hcl/apply_foo.
╷
│ Error: waiting for ECS Service (arn:aws:ecs:eu-central-1:1234567:service/foo-foo-test-fra/foo-master-test) delete: RequestError: send request failed
│ caused by: Post "https://ecs.eu-central-1.amazonaws.com/": read tcp IP:PORT->ANOTHERIP:ANOTHERPORT: read: can't assign requested address
│
│
╵
Terraform left the following resources in state after executing tests/test.tftest.hcl/apply_foo, and they need to be cleaned up manually:
- aws_acm_certificate.foo
- aws_cloudwatch_log_group.foo["centralized"]
- aws_cloudwatch_log_group.foo["master"]
- aws_cloudwatch_log_group.foo["remotestate"]
- aws_cloudwatch_log_group.foo["simple"]
- aws_ecs_cluster.foo
- aws_ecs_service.foo["master"]
- aws_ecs_task_definition.foo["centralized"]
- aws_ecs_task_definition.foo["master"]
- aws_ecs_task_definition.foo["remotestate"]
- aws_ecs_task_definition.foo["simple"]
- aws_iam_role.foo["centralized"]
- aws_iam_role.foo["master"]
- aws_iam_role.foo["remotestate"]
- aws_iam_role.foo["simple"]
- aws_lb_listener_rule.foo["centralized"]
- aws_lb_listener_rule.foo["master"]
- aws_lb_listener_rule.foo["remotestate"]
- aws_lb_listener_rule.foo["simple"]
- aws_lb_target_group.foo["centralized"]
- aws_lb_target_group.foo["master"]
- aws_lb_target_group.foo["remotestate"]
- aws_lb_target_group.foo["simple"]
- aws_secretsmanager_secret.foo["centralized"]
- aws_secretsmanager_secret.foo["master"]
- aws_secretsmanager_secret.foo["remotestate"]
- aws_secretsmanager_secret.foo["simple"]
- random_id.suffix
- module.foo_kms_key.aws_kms_alias.this
- module.foo_kms_key.aws_kms_key.this
- module.foo_lb.aws_lb.lb
- module.foo_lb.aws_lb_listener.https_forward
- module.foo_lb.aws_lb_target_group.target_group
- module.foo_lb_sg.aws_security_group.sg
- module.foo_sg.aws_security_group.sg
tests/test.tftest.hcl... fail
What about in this list append the id of the resource: module.foo_sg.aws_security_group.sg : ID
that will make it easier for the user to find and destroy the resource, WDYT?
Thanks for the context @albertorm95!
A quick note on the ID idea, which we did experiment with. Unfortunately, the idea of an id
attribute is not constant throughout the Terraform provider ecosystem. Some cloud providers do assign a unique global identifier to their resources, and then some Terraform providers do choose to expose that ID but it's not consistent. For example, the AWS provider uses the arn
attribute as the global unique identifier for many resources and doesn't expose anything called id
. Terraform core (where the test command operates) doesn't know which attribute a given provider is using as an id
so it's not aware of what attribute it should provide. We have had talks internally about creating some kind of global id
attribute that Terraform can use where appropriate, so it maybe we can use that when/if it becomes available.
For your use case, would you mind posting into the forums? That way I can dig into it with you there without sending notifications to everyone subscribed to this Github issue. I think any discussion about the format of the output, or potential improvements should stay in this ticket. But for specific use cases it's better to move into the forums.
Thanks!
Hey @lorengordon (and others in this issue), the Terraform team is doing research into this problem, and I'd love to chat to learn more. Please reach out to me [email protected] and we can schedule a time to chat!
👍 +1 for this issue - this is basically a blocker for us using terraform test
- we've been integrating basic apply testing into our module release pipelines, but keep hitting issues where manually cleanup of the test resources is needed.. in the meantime we will have to hand craft the equivalent tests using the normal terraform lifecycle.. would very much like to be able to use remote state with terraform test
We have also had this problem, specifically around permissions in AWS. If the role being used isn't created right, the delete permissions might be missing which causes the test to fail to cleanup the resources. We try to limit the permission our testing role has and we've had a lot of problems when its missing some of those permissions. Having the state saved to a remote backend (we use TFC) would be ideal and allow us to manually queue the deletions without losing what resources were created. Having just the resource name and not the IDs in the output doesn't really help us much either, we still have to go hunt down what it actually created.
Unless breakout automation handles the state file or does something with it - might it be a better idea to output the state resource IDs as well as the terraform resources?
That way we have the option of passing the ids to an API based cleanup function as well, or handle it manually should the situation call for it