terraform-plugin-sdk SDK features to help with attributes that apply to many resource types

Terraform 0.11.13

Terraform does not store provider config in the state file or anywhere else, and this makes for a whole host of problems such as:

Deleting some of the code base, and associated providers thinking you are deleting resources means that terraform is unable to operate... - it is unable to either refresh or delete the resources which were created with the now-deleted provider in the state file because it no longer has a current working-directory provider definition. This necessitates a two step removal process, and destroy is not clean. "Easy dont delete the provider" I hear you say, but when a module defines a provider more explicitly than just an alias, and you take a module out of usage, which is an entirely legitimate thing to do, you get the same problem for the same reason
Changing a provider on a resource leads to terraform not being able to handle if for example you want to swap a resource from one region to another region. Terraform never stored the original provider, so thinks the resource should be where the new provider points - the fact it then cant find it with the new provider means it deletes it from the statefile, and NOT in actuality - resulting in a successful apply, when in fact the original resource is still there potentially costing money, when you dont realise it.

These would all be fixed by storing provider config in the statefile and ensuring that delete operations use the already-stored provider config (from the state file), but create/modify operations use the current working-directory provider config. (possibly with cmdline flags which allow you to choose). Provider config doesnt usually contain any sensitive information, so I dont see there is a problem in doing it.

These are the two biggest working issues I find with terraform refactoring

Jun 05 '19 15:06 gtmtech

@mildwonkey I understand why youve added the enhancement label, but I'd say this was a full on bug. Not deleting stuff but claiming it has done (in the second case). Not being able to destroy stuff in the first case. It makes terraform pretty unusable whilst you spend hours reverting git and reapplying different parts.

Jun 06 '19 16:06 gtmtech

That's fair @gtmtech, I can change the labels!

I can also share some more references regarding your first example: there is a similar discussion of this issue in hashicorp/terraform#18994, and the module documentation does cover some of the complexities of provider configuration within modules.

Jun 06 '19 18:06 mildwonkey

Hi @gtmtech!

I just wanted to add some context here on why Terraform currently behaves the way it does, and ways it can be mitigated within the current featureset. That doesn't mean these constraints can't change, but in thinking through these changes we must keep the existing constraints/context in mind.

The first interesting thing to note here is that currently many providers have broadly two different classes of settings in their provider blocks: settings about what the provider is managing (regions, target hostnames, etc) and settings about who is running Terraform and where Terraform is being run (API credentials, locations of configuration files on local disk, etc).

In order for storing the provider configuration to make sense, we'd need to differentiate these two categories because only the "what is being managed" settings make sense to be persisted. If the other settings were saved, then along with the concern of them potentially being sensitive we'd also create a situation where when Terraform destroys something it might act using the credentials of the person that created it, not the person who is running the destroy operation.

We recommend keeping all information about who is running Terraform and where Terraform is being run out of the configuration and in "ambient" external configuration instead, but as long as providers continue to have settings for this in their configuration we could not unilaterally start storing everything in the state without some mechanism to distinguish between these different classes of settings.

I made a similar observation a while back from the angle of storing sensitive information in the state. That still remains true, I think.

Another key distinction to think about is what I will call logical vs physical locations.

For example, in the AWS provider we have region, which is a "logical" location because it describes an intent rather than an implementation of that intent. However, the AWS provider also has endpoints which some folks use to select alternative locations for the physical AWS endpoints by giving exact URLs for them.

While the AWS provider endpoints are "well known" and unlikely to change, other providers can tend to use more transient physical locations, like auto-generated IP addresses that might change in future.

If we were to save those, then you might find yourself in a situation where you can't actually delete something because the IP address that was saved is no longer current. The new IP address is logically the same service, but physically at a different location.

With those two distinctions in mind, it seems like the main thing we'd potentially want to cache in the state is a logical location, like the AWS region.

The AWS provider uses its region argument as a global setting that is assumed to apply to all resources managed by that provider, which as you've seen is problematic if you want to change that region later: suddenly Terraform thinks everything has vanished.

The Google Cloud Platform provider, which had the advantage of being developed after we learned the drawbacks of that AWS provider decision, has a different design that avoids the problem without requiring any special support from Terraform Core: it tracks the region on a per-resource basis.

The region argument at the provider level is documented as a default region, used at create time for any new resource that doesn't explicitly have its own region argument set. However, once that object is created, its region is tracked within its own state (at the resource object level, not at the provider level) and so a subsequent refresh operation has enough information to know where to look to find that resource, even if the provider-level default region has changed.

The provider is empowered to manage the region argument on each resource type as it sees fit. I believe in the case of the Google provider the region for an object is "sticky" after it is created and won't change unless explicitly overridden at the resource configuration level, but the provider could alternatively choose to detect when the provider-level default region changes and trigger a "requires replacement" change on all of the defaulted objects to reflect that the region has changed.

I think the best answer to this problem is not for Terraform Core to unilaterally store all of the provider settings in the state, or to implement new features for providers to selectively identify settings that should be stored in the state, but rather to have providers follow the model of how the Google provider manages regions for any logical location in their configurations.

It's important not to do this for physical locations (which can change independently of the logical location) or for non-location settings, but the provider already "knows" the purpose of each of its settings and so it can make that determination itself.

What I described above is already possible to do from Terraform Core's perspective, and indeed the Google Cloud Platform provider is already doing it. The AWS provider could potentially do so too, but it would require a lot of cross-cutting work today because the provider SDK does not assist with implementing the pattern and thus it must be either hand-written or (as is the case for the Google provider, I believe) handled by code generation.

For that reason I'm going to relabel this as a provider-sdk enhancement, with the intent of using it to represent a mechanism for providers to more easily implement the pattern the Google Cloud Platform provider uses to track logical location on a per-resource-instance basis.

I do see the argument for considering the current behavior of the AWS provider as a bug, since indeed it silently losing track of remote objects when the region is changed, but this is a Terraform SDK issue rather than an AWS provider issue and it's talking about a new SDK feature to make it easier to fix that AWS provider "bug" (really: design limitation). If having the underlying problem tracked as a bug feels important, I'm sure the AWS provider team wouldn't mind having an issue over in the AWS provider repository that describes the usability problem from the AWS provider perspective and links over here as a potential solution.

Having this labelled against the provider-sdk will make it visible for consideration for the SDK improvement project that is about to begin.

Jun 06 '19 23:06 apparentlymart