terraform-provider-coder `coder_metadata` gets incorrectly assigned to resources

Currently, the Terraform provisioner relies on the output of terraform graph to determine what resource coder_metadata instances should be applied to when building the template or workspace.

Unfortunately, it's very reasonable for a user to write a template where the output of terraform graph does not meet what's required to correctly map the metadata to the resource, such as when using the contents of another resource to populate a field. This results in the UI showing metadata intended for one resource on another.

Here's a minimal example of this using null_resources:

resource "coder_agent" "main" {
  os   = "linux"
  arch = "amd64"
}

// Agent resource
resource "null_resource" "about" {
  depends_on = [
    coder_agent.main,
  ]
  triggers = {
    name = "Null About"
  }
}

// Some secondary resource, such as a volume, pointing to the agent resource
resource "null_resource" "other" {
  triggers = {
    about_name = null_resource.about.triggers.name
  }
}

// Metadata pointing to the agent resource and the volume
resource "coder_metadata" "about_info" {
  resource_id = null_resource.about.id
  item {
    key = "other-reference"
    value = null_resource.other.triggers.about_name
  }
}

The graph output of this template is then:

Since null_resource.other is the closest resource to coder_metadata.about_info, the metadata gets assigned to it, instead of null_resource.about, as was indicated in the template.

The same graph and graph traversal algorithm is used to determine which resource the agent exists within, so it's possible (but not proven) that this same issue effects assigning the agent to a resource.

Regardless, this should be a purely cosmetic bug, and shouldn't have any bearing on anything outside of how workspace metadata is shown in the UI.

Aug 15 '24 14:08 ethanndickson

coder/coder#6517 could be related.

Aug 15 '24 16:08 matifali

https://github.com/coder/terraform-provider-coder/issues/133 also looks related.

Aug 20 '24 07:08 matifali

unassigning myself for now in case @Emyrk can pick it up, but I might circle back if he doesn't

Jan 07 '25 18:01 aslilac

One thing to keep in mind when solving this is that a Terraform resource ID isn't a protected attribute or anything special. There's nothing stopping popular providers from using the same id attribute for multiple resources. The AWS provider shares an ID between ec2_instance_state and instance, as seen in the example for the first. I wouldn't be surprised if we can't handle that edge case :(

Jan 08 '25 03:01 ethanndickson

I was unable to get to this, but pretty sure the code is located here in coderd: https://github.com/coder/coder/blob/389768b48f8d0ab9cbd28bc66f7bae308011062d/provisioner/terraform/resources.go#L563-L578

And should be fixed there.

Jan 21 '25 18:01 Emyrk

I've been looking into this issue and I have a pretty good idea of what's going on and what's going wrong.

As I see it, we the following problems:

The graph generated by terraform is lossy, we lose information about the association expressed in the terraform (resource id -> specific resource)
The graph from terraform plan has less details than that from terraform apply
IDs are not guaranteed to be unique across resources

For 1 and 2, we can utilize tfplan and tfstate JSON outputs, however, for 3 I have yet to figure out a 100% solution.

Proposal: For plan we use the tfplan "configuration" field to figure out which resource is being referenced and for apply we simply use the resource_id -> id association.

Does this seem like a sufficient solution for the problem? See below for more details on what data is available to us.

Thankfully, the tfplan actually contains everything we need (see resource_id, JSON has been trimmed):

{
  "configuration": {
    "root_module": {
      "resources": [
        {
          "address": "coder_metadata.about_info",
          "mode": "managed",
          "type": "coder_metadata",
          "name": "about_info",
          "provider_config_key": "coder",
          "expressions": {
            "item": [
              {
                "key": {
                  "constant_value": "other-reference"
                },
                "value": {
                  "references": [
                    "null_resource.other.triggers.about_name",
                    "null_resource.other.triggers",
                    "null_resource.other"
                  ]
                }
              }
            ],
            "resource_id": {
              "references": [
                "null_resource.about.id",
                "null_resource.about"
              ]
            }
          },
          "schema_version": 0
        }
      ]
    }
  }
}

As can be ween in this JSON output, the exact resource and field is being referenced in the configuration. We've yet to utilize this part of the tfplan, which seems like a shame, it could probably also be used to make a better resource<->agent association.

However, if we assume that the uniqueness of ID is unreliable, the tfstate actually ends up being more lossy for our use-case as it references the potentially non-unique ID instead of the unique resource name. (JSON has been trimmed.)

{
  "values": {
    "root_module": {
      "resources": [
        {
          "address": "coder_metadata.about_info",
          "mode": "managed",
          "type": "coder_metadata",
          "name": "about_info",
          "provider_name": "registry.terraform.io/coder/coder",
          "schema_version": 0,
          "values": {
            "daily_cost": null,
            "hide": null,
            "icon": null,
            "id": "f0c9ff7b-48fa-425a-ac82-8bc4759802db",
            "item": [
              {
                "is_null": false,
                "key": "other-reference",
                "sensitive": false,
                "value": "Null About"
              }
            ],
            "resource_id": "8302293550006744338"
          },
          "sensitive_values": {
            "item": [
              {}
            ]
          },
          "depends_on": [
            "coder_agent.main",
            "null_resource.about",
            "null_resource.other"
          ]
        },
        {
          "address": "null_resource.about",
          "mode": "managed",
          "type": "null_resource",
          "name": "about",
          "provider_name": "registry.terraform.io/hashicorp/null",
          "schema_version": 0,
          "values": {
            "id": "8302293550006744338",
            "triggers": {
              "name": "Null About"
            }
          },
          "sensitive_values": {
            "triggers": {}
          },
          "depends_on": [
            "coder_agent.main"
          ]
        }
      ]
    }
  }
}

AFAICT, the best we can do here is rely on resource ID (without passing on information from plan). My guess is that we could further reduce the number of possible resources by checking that the resource with said ID exists in depends_on, but this would need to be verified, and it's still a half-measure.

Feb 03 '25 17:02 mafredri

I just had a good chat with @mafredri

In summary, we believe using terraform graph might not be the best method of extracting the relationship for coder_metadata resources. @mafredri explains some of the lossy behavior above.

To recap in my own words:

terraform graph does not retain the underlying hcl expressions. This might not be an issue, however it is lossy. Especially if there are cases of nested references (A -> B -> C)
Terraform graph is relating nodes based on the references contained in the item blocks. The only argument of importance is resource_id. In the example above, the item block reference is closer than the resource_id reference.

In the long term (2mo+), I proposed to leverage the terraform evaluation engine that is under works to determine the relationships instead.

We can do this by looking at the static hcl expression for the resource_id argument, meaning we can likely do this entirely before a terraform plan/apply. There is still work to verify this proposal, and come up with the tricky edge cases with the expected output.

@mafredri given my proposal above, do you think you have a cheap fix that can solve some of the more obvious errors?

Feb 04 '25 15:02 Emyrk

@Emyrk thanks for the great chat and writing the summary!

While I'm a fan of consolidating our Terraform logic in one place, and would like to see us use the static analysis your working on @Emyrk, we can fairly simply fix this issue by utilizing the configuration/resource ID of tf plan/apply and probably cover 99.9% of cases.

I'd suggest we fall-back to the current graph-based matching in the case of duplicate resource IDs, and pick the closest match of those IDs.

@bpmct I'm guessing a 2mo+ for a fix is not ideal here, but I'll defer to you whether or not we should fix this now or wait.

Feb 05 '25 13:02 mafredri

we can fairly simply fix this issue by utilizing the configuration/resource ID of tf plan/apply and probably cover 99.9% of cases

Let's just do this.

I'm guessing a 2mo+ for a fix is not ideal here

It wouldn't take 2 months to implement something special, right? It would be 2 months until we have static analysis (for other purposes), then we can repurpose to do this assignment slightly better?

I'm in favor of the quick fix now and then evaluating using better methods once we have them.

Feb 05 '25 16:02 bpmct