terraform-cdk icon indicating copy to clipboard operation
terraform-cdk copied to clipboard

cdktf: Large `terraform.<stack>.tfstate` file is cleaned when `destroy` is cancelled

Open MJohnson459 opened this issue 1 year ago • 4 comments

Expected Behavior

I am running cdktf destroy via a python using subprocess.run. The exact command is

import subprocess
subprocess.run(["cdktf", "destroy", "--auto-approve"], check=True)

When this is destroying instances, pressing ctrl +c should stop the destroy and leave the terraform.<stack>.tfstate file with an accurate list of state.

Actual Behavior

The terraform.<stack>.tfstate file is empty.

Steps to Reproduce

I did a fair bit of testing to try and narrow down what exactly is happening here and get it to the smallest repeatable unit. It seems like the python subprocess call is required for the state to be emptied permanently, however using a raw cdktf destroy I noticed that the state file is briefly emptied and then recreated. This seems dependent on the size of the state.

I suspect what is happening is when the subprocess is cancelled, there isn't enough time given for the cdktf process to recreate the state file. Normally the cdktf process would be given about 10 seconds before the kill is escalated.

  1. Start an instance with a lot of resources. For example many files.
  2. Run cdktf destroy via python
import subprocess
subprocess.run(["cdktf", "destroy", "--auto-approve"], check=True)
  1. When the resources are being destroyed, press ctrl +c.

I can reliably reproduce this with a ~5MB state file.

Versions

$ cdktf debug
language: python
cdktf-cli: 0.18.0
node: v18.12.0
cdktf: 0.18.0
constructs: 10.2.70
jsii: 1.89.0
terraform: 1.6.4
arch: x64
os: linux 5.15.0-88-generic
python: Python 3.9.6
pip: pip 21.1.3 from /home/michael/.pyenv/versions/3.9.6/envs/locus3.9/lib/python3.9/site-packages/pip (python 3.9)
pipenv: null

Providers

┌───────────────┬──────────────────┬─────────┬────────────┬─────────────────────────────┬─────────────────┐
│ Provider Name │ Provider Version │ CDKTF   │ Constraint │ Package Name                │ Package Version │
├───────────────┼──────────────────┼─────────┼────────────┼─────────────────────────────┼─────────────────┤
│ aws           │ 5.19.0           │ ^0.18.0 │            │ cdktf-cdktf-provider-aws    │ 17.0.8          │
├───────────────┼──────────────────┼─────────┼────────────┼─────────────────────────────┼─────────────────┤
│ random        │ 3.5.1            │ ^0.18.0 │            │ cdktf-cdktf-provider-random │ 9.0.0           │
├───────────────┼──────────────────┼─────────┼────────────┼─────────────────────────────┼─────────────────┤
│ tls           │ 4.0.4            │ ^0.18.0 │            │ cdktf-cdktf-provider-tls    │ 8.0.0           │
└───────────────┴──────────────────┴─────────┴────────────┴─────────────────────────────┴─────────────────┘

Gist

No response

Possible Solutions

It might be possible to write the state to a temporary file and then swap instead of emptying the existing state and rewriting it.

Workarounds

No response

Anything Else?

No response

References

No response

Help Wanted

  • [ ] I'm interested in contributing a fix myself

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

MJohnson459 avatar Nov 16 '23 17:11 MJohnson459

Hey, could you try to run this outside of python through CDKTF CLI and see if the problem persists? We are using the AbortController API under the hood and are forwarding the abort signal to the Terraform CLI. If you abort on CDKTF CLI directly I would assume it takes a few seconds for the abort inside Terraform CLI to be processed. If the python package aborts hard directly or after a certain grace period I think a possibly corrupted state file might be a possibility.

DanielMSchmidt avatar Nov 17 '23 12:11 DanielMSchmidt

Hey. Running the cdktf directly doesn't permanently remove the state however it is still temporarily in a corrupted state. Do you know why the state file is cleaned at all during the process? I had a look through the code but couldn't find where that happens. Is the state file managed by terraform directly or by the CDK?

Given that the updating the state is not being done atomically, it seems likely that other workflows outside the specific python case could also corrupt the state. I'm happy to look into a fix if you can point me in the right direction.

MJohnson459 avatar Nov 17 '23 13:11 MJohnson459

The state file is managed by terraform directly, I think it's most likely somewhere in terraform.

DanielMSchmidt avatar Nov 17 '23 13:11 DanielMSchmidt

It looks like there is already a TODO comment in terraform for this issue https://github.com/hashicorp/terraform/blob/1f9734619f953ecc7252d7b98a6d40d751b4ea1e/internal/states/statemgr/filesystem.go#L124C1-L129C18

MJohnson459 avatar Nov 17 '23 13:11 MJohnson459