terraform icon indicating copy to clipboard operation
terraform copied to clipboard

Resource state not refreshed before destroy

Open jooola opened this issue 1 year ago • 2 comments

We maintain a terraform provider that is failing to delete a "primary IP" resource in certain circumstances.

Before we can delete a primary IP resource, we have to unassign it from a server instance in our API.

During our tests of the delete call, we noticed that the state of the primary IP resource references an invalid server instance. This led to try to unassign the primary IP from an instance that was not assigned. Which made the primary IP resource deletion fail.

That bug is only happening in terraform >=1.9, so we ran a git bisect from the last known terraform version to work (1.8.5). We found the problematic commit to be 460c7f3933115c3edf670caacd2ffa489ef4eeb8 https://github.com/hashicorp/terraform/pull/35467

Reverting that commit on top of the v1.9 branch fixes our issue.

Terraform Version

Terraform v1.9.4
on linux_amd64

Terraform Configuration Files

On https://github.com/hetznercloud/terraform-provider-hcloud/tree/tf-1.9-primary-ip-delete

Debug Output

TF_ACC=1 go test ./internal/server -run TestServerResource_PrimaryIPTests -v -timeout=30m -parallel=8
=== RUN   TestServerResource_PrimaryIPTests
    resource_test.go:908: 
        
        HCL:
            1: resource "hcloud_ssh_key" "server" {
            2:   name        = "server--665592047586110587"
            3:   public_key  = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQCs9urdjWxJqCVELlPhuQBrCsTJ0XdF3j2+VETM59knH2crDtaAK2omx1cX5MFwjHKEea9fZEel/w0Vj4NhfgsGMe7JSr4Cj5bubcVGI7rJ12Ohl3QFLOZ6azwd13gT6K6o2g3OZtGQagxso4u4BOp9KyLy6wcxO4DVDhyt2Le38w== hcloud@ssh-acceptance-test"
            4:     labels = {
            5:       key = "4999487623275307476"
            6:   }
            7: }
            8: 
            9: resource "hcloud_primary_ip" "primary-ip-v4" {
           10:   name        = "primaryip-v4-test--665592047586110587"
           11:   type = "ipv4"
           12:   datacenter       = "nbg1-dc3"
           13:   
           14:   assignee_type       = "server"
           15:   
           16:   auto_delete       = false
           17: }
           18: 
           19: resource "hcloud_server" "server-primaryIP-test" {
           20:   name        = "server-primaryIP-test--665592047586110587"
           21:   server_type = "cpx11"
           22:   image       = "ubuntu-24.04"
           23:   datacenter  = "nbg1-dc3"
           24:   
           25:   ssh_keys    = [hcloud_ssh_key.server.id]
           26:   
           27:   public_net {
           28:     ipv4 = hcloud_primary_ip.primary-ip-v4.id
           29:     ipv4_enabled = true
           30:     ipv6_enabled = false
           31:   }
           32:   
           33: 
           34:   
           35: }
        
    resource_test.go:927: 
        
        HCL:
            1: resource "hcloud_ssh_key" "server" {
            2:   name        = "server--665592047586110587"
            3:   public_key  = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQCs9urdjWxJqCVELlPhuQBrCsTJ0XdF3j2+VETM59knH2crDtaAK2omx1cX5MFwjHKEea9fZEel/w0Vj4NhfgsGMe7JSr4Cj5bubcVGI7rJ12Ohl3QFLOZ6azwd13gT6K6o2g3OZtGQagxso4u4BOp9KyLy6wcxO4DVDhyt2Le38w== hcloud@ssh-acceptance-test"
            4:     labels = {
            5:       key = "4999487623275307476"
            6:   }
            7: }
            8: 
            9: resource "hcloud_primary_ip" "primary-ip-v4" {
           10:   name        = "primaryip-v4-test--665592047586110587"
           11:   type = "ipv4"
           12:   datacenter       = "nbg1-dc3"
           13:   
           14:   assignee_type       = "server"
           15:   
           16:   auto_delete       = false
           17: }
           18: 
           19: resource "hcloud_server" "server-primaryIP-test" {
           20:   name        = "server-primaryIP-test--665592047586110587"
           21:   server_type = "cpx11"
           22:   image       = "ubuntu-24.04"
           23:   datacenter  = "nbg1-dc3"
           24:   
           25:   ssh_keys    = [hcloud_ssh_key.server.id]
           26:   
           27: 
           28:   
           29: }
        
    resource_test.go:946: 
        
        HCL:
            1: resource "hcloud_ssh_key" "server" {
            2:   name        = "server--665592047586110587"
            3:   public_key  = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQCs9urdjWxJqCVELlPhuQBrCsTJ0XdF3j2+VETM59knH2crDtaAK2omx1cX5MFwjHKEea9fZEel/w0Vj4NhfgsGMe7JSr4Cj5bubcVGI7rJ12Ohl3QFLOZ6azwd13gT6K6o2g3OZtGQagxso4u4BOp9KyLy6wcxO4DVDhyt2Le38w== hcloud@ssh-acceptance-test"
            4:     labels = {
            5:       key = "4999487623275307476"
            6:   }
            7: }
            8: 
            9: resource "hcloud_primary_ip" "primary-ip-v6" {
           10:   name        = "primaryip-v6-test--665592047586110587"
           11:   type = "ipv6"
           12:   datacenter       = "nbg1-dc3"
           13:   
           14:   assignee_type       = "server"
           15:   
           16:   auto_delete       = false
           17: }
           18: 
           19: resource "hcloud_server" "server-primaryIP-test" {
           20:   name        = "server-primaryIP-test--665592047586110587"
           21:   server_type = "cpx11"
           22:   image       = "ubuntu-24.04"
           23:   datacenter  = "nbg1-dc3"
           24:   
           25:   ssh_keys    = [hcloud_ssh_key.server.id]
           26:   
           27:   public_net {
           28:     ipv4_enabled = false
           29:     ipv6 = hcloud_primary_ip.primary-ip-v6.id
           30:     ipv6_enabled = true
           31:   }
           32:   
           33: 
           34:   
           35: }
        
=== PAUSE TestServerResource_PrimaryIPTests
=== CONT  TestServerResource_PrimaryIPTests
    resource_test.go:901: Step 3/3 error: Error running apply: exit status 1
        
        Error: unexpected assignee id: state_assignee_id=51802107 primary_ip_id=67207997 server_ipv4_id=67208025 server_ipv6_id=67208029
        
--- FAIL: TestServerResource_PrimaryIPTests (77.92s)
FAIL
FAIL	github.com/hetznercloud/terraform-provider-hcloud/internal/server	77.934s
FAIL
make: *** [GNUmakefile:65: testacc] Error 1

Traces: https://github.com/hetznercloud/terraform-provider-hcloud/blob/tf-1.9-primary-ip-delete/debug/test-TestServerResource_PrimaryIPTests.log

Actual Behavior

The state of the primary IP resource references an invalid server instance. Which led to try to unassign the primary IP from an instance that was not assigned. Which made the primary IP resource deletion fail.

Expected Behavior

We expect the state of the primary IP resource to be refreshed before we can destroy it.

Steps to Reproduce

git clone --branch tf-1.9-primary-ip-delete https://github.com/hetznercloud/terraform-provider-hcloud
cd terraform-provider-hcloud

export TF_LOG="trace"
export TF_LOG_PATH_MASK="test-%s.log"
make TEST="./internal/server" TESTARGS="-run TestServerResource_PrimaryIPTests" testacc

Additional Context

References

  • 460c7f3933115c3edf670caacd2ffa489ef4eeb8
  • https://github.com/hashicorp/terraform/pull/35467

jooola avatar Aug 15 '24 14:08 jooola

Hi @jooola,

Thanks for filing the issue. I have not been able to replicate the issue from your config yet, and unfortunately the logs don't have any of the core trace output either. I'm wondering if this is something specific to the acceptance tests themselves, can you run a standalone example from the command line? (that would also give easier access to the core logs, especially if you use TF_LOG_CORE=trace to avoid all the plugin logging).

jbardin avatar Aug 15 '24 15:08 jbardin

Sadly, we are having some trouble to make this reproducible outside our provider.

We were able to reproduce the bug only running from the CLI.

I pushed the core trace logs and the backed up states for each steps in our debug branch. The failing step is the step 5 https://github.com/hetznercloud/terraform-provider-hcloud/blob/tf-1.9-primary-ip-delete/debug/step5/step5.log

The error is ocurring at this line: https://github.com/hetznercloud/terraform-provider-hcloud/blob/tf-1.9-primary-ip-delete/debug/step5/step5.log#L1300

The previous steps can be found in https://github.com/hetznercloud/terraform-provider-hcloud/blob/tf-1.9-primary-ip-delete/debug/

To re-state our problem, we are running a delete resource using a state that is not up-to-date. Previously, the state was refreshed using the read resource call before performing the delete resource call.

We are not sure if it is now our responsibility to pull fresh data ourselves in the delete call, or if terraform is the one responsible for refreshing the state before deleting the resource.

jooola avatar Aug 20 '24 12:08 jooola

I think I see a similar issue.

On Terraform 1.9.5, I see this inconsistency using one of my providers.

I tried running a destroy plan in 2 ways:

  1. I ran terraform destroy
root@lglap049:~/terraform-ome-qe/test# terraform destroy
ome_firmware_catalog.firmware_catalog_example1: Refreshing state... [name=rounak_1]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  # ome_firmware_catalog.firmware_catalog_example1 will be destroyed
  - resource "ome_firmware_catalog" "firmware_catalog_example1" {
      - associated_baselines     = [] -> null
      ...
      - share_password           = (sensitive value) -> null
      - share_type               = "HTTPS" -> null
      - share_user               = "example-user" -> null
      - source_path              = "catalogs" -> null
      - status                   = "Failed" -> null
      - task_id                  = 11226 -> null
    }

Plan: 0 to add, 0 to change, 1 to destroy.

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: no

Destroy cancelled.
root@lglap049:~/terraform-ome-qe/test# 
  1. I commented the resource block and ran terraform apply.
root@lglap049:~/terraform-ome-qe/test# terraform apply
ome_firmware_catalog.firmware_catalog_example1: Refreshing state... [name=rounak_1]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  # ome_firmware_catalog.firmware_catalog_example1 will be destroyed
  # (because ome_firmware_catalog.firmware_catalog_example1 is not in configuration)
  - resource "ome_firmware_catalog" "firmware_catalog_example1" {
      - associated_baselines     = [] -> null
      ...
      - share_password           = (sensitive value) -> null
      - share_type               = "HTTPS" -> null
      - share_user               = "example-user" -> null
      - source_path              = "catalogs" -> null
      - status                   = "Running" -> null
      - task_id                  = 11226 -> null
    }

Plan: 0 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: no

Apply cancelled.
root@lglap049:~/terraform-ome-qe/test# 

Now, the true state of the resource's status field is Failed, which is what we should see if refresh is happening correctly. But when running terraform apply, u can see that the field is still the old value Running. This is surprising since the logs show that a refresh is happening.

rounak-adhikary avatar Jan 21 '25 10:01 rounak-adhikary

Thank you @rounak-adhikary, that explains the difference! The changes for deferred actions were not done in the correct order for orphaned resource instances, so the refreshed state was not refreshed in the change.

jbardin avatar Jan 21 '25 19:01 jbardin

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Feb 24 '25 02:02 github-actions[bot]