kismatic icon indicating copy to clipboard operation
kismatic copied to clipboard

Provisioner: Destruction of infrastructure should only be possible through a destroy operation

Open alexbrand opened this issue 6 years ago • 5 comments

As we add provisioning capabilities to KET, we need to be careful about any destruction operations that Terraform might decide to perform.

We need to find a way to prevent terraform from doing any destructive operations when creating or modifying existing infrastructure. The only acceptable time for destroying infrastructure is when the user has requested the destruction of their cluster.

This imposes a limitation on scaling, because we won't be able to reduce the number of nodes that make up the cluster. With that said, this is an acceptable limitation for the initial release of this feature.

alexbrand avatar Dec 14 '17 17:12 alexbrand

https://github.com/hashicorp/terraform/issues/3116 prevents the usage of localizing destruction prevention to a terraform variable. This has ugly implications, as it means we need to do one of several things in order to dynamically prevent destruction of resources.

Options moving forward:

  • Terraform will accept any input as JSON, meaning we can convert all of our existing Terraform HCL to JSON and it should work identically. We can even do this completely automatically (thanks https://github.com/kvz/json2hcl )! Once this information exists as JSON, we can marshal it to and from KET, potentially changing the prevent_destroy field with ease.
    • Pros: It seems to be the most maintainable and least hack-y way to work around this limitation.
    • Cons: It's still a hack. We'd have to copy over all of the JSON for each cluster, since the resource script now contains state. I don't foresee our users potentially spinning up thousands of concurrent clusters, but if they do, this poses a limitation on disk space.
  • Do basically the same as before, but enforce serialization on state read/write via mutex.
    • Pros: Also very maintainable. We now don't have to copy everything over, but...
    • Cons: ... the lock on read/write will slow things down in the multi-user/server use cases - potentially by a lot.
  • Drop the plan output. This will force Terraform to prompt the user to confirm their desired changes, including creation or destruction.
    • Pros: This fits nicely into the CLI workflow.
    • Cons: This isn't really feasible in server mode.
  • Nothing. We leave it exactly how it is now, and drop a fat warning in the release notes.
    • Pros: It already works like this.
    • Cons: Our users might shoot themselves in the foot, leaving us indirectly responsible for nuking production.

These are ordered roughly by how valuable I think each solution is, but feel free to post any other ideas/thoughts/suggestions.

based64god avatar Dec 28 '17 16:12 based64god

Another potential option is to inspect the output of the terraform plan command, and determine whether terraform is planning to destroy infrastructure. I dislike the parsing of arbitrary output, as we can't get JSON out of TF, but this might be a better solution than expecting every single resource defined in terraform to be written in JSON, and having to read/modify/write config out to disk.

alexbrand avatar Jan 02 '18 16:01 alexbrand

I am now wondering if we could use terraform as a library to ingest the plan, and determine if there are any resources marked for destruction.

alexbrand avatar Jan 02 '18 16:01 alexbrand

The only thing that worries me about that is the potential for API changes under the hood that might just set everything on fire - and if/how we'd want to refactor the existing work to do that.

based64god avatar Jan 02 '18 16:01 based64god

The fix that was merged in for this is a workaround until terraform allows interpolation in lifecycle attributes. Marking this as needing an upstream fix, and leaving this open.

based64god avatar Jan 19 '18 14:01 based64god