terraform-oci-oke icon indicating copy to clipboard operation
terraform-oci-oke copied to clipboard

[v4][regression] Operator VM gets recreated at every non-OKE module update

Open denismakogon opened this issue 4 years ago • 5 comments

On the latest 4.x branch, after an OKE cluster with operator host is created via terraform apply if I would call terraform plan it will output that an oci_core_instance.operator resource must be replaced:

  # module.oke-v4.module.operator[0].oci_core_instance.operator must be replaced
-/+ resource "oci_core_instance" "operator" {
      ~ availability_domain                 = "VMwN:UK-LONDON-1-AD-1" -> (known after apply) # forces replacement
      ~ boot_volume_id                      = "ocid1.bootvolume.oc1.uk-london-1.abwgiljs2vxcfitwxjkzrqyujiinwth3rq2w5mlkvbyad3ttjll25k27ahzq" -> (known after apply)
      + capacity_reservation_id             = (known after apply)
      + dedicated_vm_host_id                = (known after apply)
      ~ defined_tags                        = {} -> (known after apply)
      - extended_metadata                   = {} -> null
      ~ fault_domain                        = "FAULT-DOMAIN-2" -> (known after apply)
      ~ hostname_label                      = "test-operator" -> (known after apply)
      ~ id                                  = "ocid1.instance.oc1.uk-london-1.anwgiljss7djfsiccnnusqvn3rlazt2pl5vs5ttsqdd36bgyl44cdl5ogo7q" -> (known after apply)
      ~ image                               = "ocid1.image.oc1.uk-london-1.aaaaaaaa646hmq7yvlxk6wqhdzrljfxdy7iyy6wk7xtmdf3x73ko45nwqfsa" -> (known after apply)
      + ipxe_script                         = (known after apply)
      + is_pv_encryption_in_transit_enabled = (known after apply)
      ~ launch_mode                         = "PARAVIRTUALIZED" -> (known after apply)
      ~ metadata                            = {
          - "ssh_authorized_keys" = <<-EOT
                <SANITIZED>
            EOT
          - "user_data"           = "H4sIAAAAAAAA/3RUXW/iPBO9j9T/MKKXXXCA8LnqSuE7QBpoAwt7UxnHJG4T29gOEPT++FfQ3W71PHpylfGcOT4zc+S+4IZyUw4LSbuQ5alhEiuDMnam0XfYiZxHWBWPJd/zh71g9TRwn7cl6xqV11RpJngXqhX7zrqzyuWvoDvrD/eAaSk0MzcsNgaTJKPcfIc9SynHGX0sCUkVNkJVCpylpb+locJc76kqDzkREeNxF1o7Zr4AbroNPRtEUpFHZSL4nsV3ls8y+i+F99AXslAsTgzU7GrrG9TsWvUbBAqTlEJfKCkUvgoFzCMkFOD9nqUMG6orAG6awq1Yg6KaqiONKtY9zBmhXNMIch5RBSahsOLsSJXGKSyoypjW7Ej/4OB4lQNYg07EiQM2kBgjdRchoXVF3LRUiMhQ+lGgUS5TZFn3Xzu0JCbvOKavuYwVjq5DUDm1DMvoRXDaBTfXRuGUYfRSRJwW1kkxQ1+vM9dd6x40NbkETRSTxgIog8Qm6UIJKSEM+rOQz5+KTkoWAID83ZDgugslu2XbH+f0c0Ol+MLkww5r2nQ+cuRjW1343y0EmDjac39/yC6WoYNCvzcYtZCzemvW7Om5qNeX/vrgUntdvDhk5nnhr+rU17PqfNsOG+OHxSo52NP28bLP0RPv2K2osRHq8BL6Yd6Wppi9jRYjskFOu9eaoP3YVc66vZ0X2XsaJDOW9bd558Vd1ga92WRWi8f7ZXsQHQqmN71dHrQWgxEaknRz9DWl/eBt/PBczZyZu247dS9cr1PHaQ1iO6j2+rL+wk/skI3V04MZN+TTIX0oiii8kA1K+OoSvjvbWugOGufTOX5Hi2bViHjSDPw+D/xE7M7PsuFOVRB5VX/kOYjKhvPLLUaXE6rXJjtvNFTJdkCaMmTMrlLldljQj3fCr9XTnbcmD03pH87rDg+ea0nrMMoGuzQbu/MFT4+t6WiQzX2aCKW9E4/CcNPx3JyqhgjHnXkwC875cTh9O5g6aTZHKECmFf0aXejTW+Ny+tlMshparlx3gZC7dF2EOixFP9+d09B13cdHS+WcZFHXgjJQkggo9W/uzBXjMXx6p1IpXRE7rBP4b3/9JaFnKZSBoO+99ufeq7sKJ4+Ma4M5oa9SMU6YxGkJfvwAQInIKBKSoMqVX5ErjRE5Sb6kPm/ZM8508s+3qly+s/4fAAD//7qFZ0wKBQAA"
        } -> (known after apply) # forces replacement
      ~ private_ip                          = "10.0.0.10" -> (known after apply)
      + public_ip                           = (known after apply)
      ~ region                              = "uk-london-1" -> (known after apply)
      ~ subnet_id                           = "ocid1.subnet.oc1.uk-london-1.aaaaaaaadabk34qp3fkfwi2ur7dk5e37q34hpn4i3nbgauvqszcc4zi67a4q" -> (known after apply)
      ~ system_tags                         = {} -> (known after apply)
      ~ time_created                        = "2021-10-14 07:47:03.581 +0000 UTC" -> (known after apply)
      + time_maintenance_reboot_due         = (known after apply)
        # (5 unchanged attributes hidden)

      ~ agent_config {
            # (3 unchanged attributes hidden)

          - plugins_config {
              - desired_state = "ENABLED" -> null
              - name          = "Management Agent" -> null
            }
            # (1 unchanged block hidden)
        }

      ~ availability_config {
          ~ is_live_migration_preferred = false -> (known after apply)
          ~ recovery_action             = "RESTORE_INSTANCE" -> (known after apply)
        }

      ~ create_vnic_details {
          - assign_private_dns_record = false -> null
          ~ defined_tags              = {} -> (known after apply)
          ~ freeform_tags             = {
              - "environment" = "dev"
              - "role"        = "operator"
            } -> (known after apply)
          ~ private_ip                = "10.0.0.10" -> (known after apply)
          ~ skip_source_dest_check    = false -> (known after apply)
          + vlan_id                   = (known after apply)
            # (5 unchanged attributes hidden)
        }

      ~ instance_options {
          ~ are_legacy_imds_endpoints_disabled = false -> (known after apply)
        }

      ~ launch_options {
          ~ firmware                            = "UEFI_64" -> (known after apply)
          ~ is_consistent_volume_naming_enabled = true -> (known after apply)
          ~ is_pv_encryption_in_transit_enabled = false -> (known after apply)
          ~ remote_data_volume_type             = "PARAVIRTUALIZED" -> (known after apply)
            # (2 unchanged attributes hidden)
        }

      + platform_config {
          + is_measured_boot_enabled           = (known after apply)
          + is_secure_boot_enabled             = (known after apply)
          + is_trusted_platform_module_enabled = (known after apply)
          + numa_nodes_per_socket              = (known after apply)
          + type                               = (known after apply)
        }

      + preemptible_instance_config {
          + preemption_action {
              + preserve_boot_volume = (known after apply)
              + type                 = (known after apply)
            }
        }

      ~ shape_config {
          + baseline_ocpu_utilization     = (known after apply)
          + gpu_description               = (known after apply)
          ~ gpus                          = 0 -> (known after apply)
          + local_disk_description        = (known after apply)
          ~ local_disks                   = 0 -> (known after apply)
          ~ local_disks_total_size_in_gbs = 0 -> (known after apply)
          ~ max_vnic_attachments          = 2 -> (known after apply)
          ~ networking_bandwidth_in_gbps  = 1 -> (known after apply)
          ~ processor_description         = "2.25 GHz AMD EPYC™ 7742 (Rome)" -> (known after apply)
            # (2 unchanged attributes hidden)
        }

      ~ source_details {
          ~ boot_volume_size_in_gbs = "47" -> (known after apply)
          + kms_key_id              = (known after apply)
            # (2 unchanged attributes hidden)
        }

        # (1 unchanged block hidden)
    }

Not sure what has changed, but this is a regression as previously operator host remained the same through the lifecycle of an OKE cluster.

denismakogon avatar Oct 14 '21 08:10 denismakogon

I'm unable to replicate this behaviour.

hyder avatar Oct 18 '21 01:10 hyder

Closed due to lack of activity. Please reopen if this is still impacting you

hyder avatar Oct 20 '21 23:10 hyder

@hyder We faced a similar issue, after a lot of debugging. It turns out the below change (which wasn't caused by any code change. It just shows up randomly), causes the plan to have 4 to add, 30 to change, 3 to destroy.

But after removing all the depends_on = [ module.vcn] in main.tf. The plan had only one update. The rest of the changes were gone. https://itnext.io/beware-of-depends-on-for-modules-it-might-bite-you-da4741caac70

I am not sure what's causing the security list to change, but I want to leave this comment to help anyone who stumbles upon this issue like I did.

# module.terraform-oci-oke.module.vcn[0].oci_core_default_security_list.lockdown[0] will be updated in-place
~ resource "oci_core_default_security_list" "lockdown" {
      id                         = "REDACTED"
      # (7 unchanged attributes hidden)

    - egress_security_rules {
        - destination      = "172.16.64.0/18" -> null
        - destination_type = "CIDR_BLOCK" -> null
        - protocol         = "6" -> null
        - stateless        = false -> null

        - tcp_options {
            - max = 10256 -> null
            - min = 10256 -> null
          }
      }
    - egress_security_rules {
        - destination      = "172.16.64.0/18" -> null
        - destination_type = "CIDR_BLOCK" -> null
        - protocol         = "6" -> null
        - stateless        = false -> null

        - tcp_options {
            - max = 31440 -> null
            - min = 31440 -> null
          }
      }

    - ingress_security_rules {
        - protocol    = "6" -> null
        - source      = "0.0.0.0/0" -> null
        - source_type = "CIDR_BLOCK" -> null
        - stateless   = false -> null

        - tcp_options {
            - max = 80 -> null
            - min = 80 -> null
          }
      }
    - ingress_security_rules {
        - protocol    = "6" -> null
        - source      = "172.16.2.32/27" -> null
        - source_type = "CIDR_BLOCK" -> null
        - stateless   = false -> null

        - tcp_options {
            - max = 10256 -> null
            - min = 10256 -> null
          }
      }
    - ingress_security_rules {
        - protocol    = "6" -> null
        - source      = "172.16.2.32/27" -> null
        - source_type = "CIDR_BLOCK" -> null
        - stateless   = false -> null

        - tcp_options {
            - max = 31440 -> null
            - min = 31440 -> null
          }
      }
  }

bader-tayeb avatar Aug 31 '22 13:08 bader-tayeb

I'm reopening the issue

hyder avatar Aug 31 '22 20:08 hyder

hi @bader-tayeb,

Thanks for notifying us on this issue. Can I please check with you if you created a service of type LoadBalancer by any chance? You may have created it manually or you have deployed an ingress controller or a packaged helm chart that caused a LoadBalancer to be created?

hyder avatar Aug 31 '22 20:08 hyder

@hyder After looking into it, yes we've created a service of type LoadBalancer and that caused the "oci_core_default_security_list" "lockdown" to have a plan change in terraform.

But it was difficult to spot this change because the depends on caused it to have 30+ changes instead of just the one.

bader-tayeb avatar Sep 04 '22 10:09 bader-tayeb

@bader-tayeb thanks for confirming. While investigate a solution, when you create the load balancer, can you please set the load balancer annotations so that the frontend management mode is none and you also specify the nsg? This will ensure your load balancer is healthy while not modifying the default security list.

hyder avatar Sep 04 '22 11:09 hyder

We too had a similar issue, with the seclists wanting to change and the bastion wanting to be rebuilt at each apply.

It solved itself when we moved to a VCN provisioned out of the module, so this might be another workaround for people reading this.

12345ieee avatar Sep 04 '22 12:09 12345ieee

@hyder we've used the annotations as suggested (link), but the issue still persists. Every time we create the load balancer, the default security list still gets changed.

bader-tayeb avatar Sep 05 '22 12:09 bader-tayeb

There might be more than 1 issue in play. We'll try to identify and fix the problem(s).

Please bear with us.

hyder avatar Sep 05 '22 14:09 hyder

@bader-tayeb @12345ieee:

I've just created a PR (#565) for this. Can you please try this in a new cluster and let us know if this fixes the issue?

hyder avatar Sep 06 '22 05:09 hyder

hi @bader-tayeb @12345ieee,

Can you please test this PR before I merge? Otherwise, I'll assume it's working and go ahead.

hyder avatar Sep 08 '22 02:09 hyder

@hyder This is an acceptable workaround, it fixes the issue. But it invalidates the need for the "oci_core_default_security_list" "lockdown" resource. Since I assume this resource's goal is to prevent such changes.

bader-tayeb avatar Sep 08 '22 08:09 bader-tayeb

Thanks for testing. It won't invalidate the need to lockdown the default security list. It just won't trigger a recreation of other resources such as the bastion host if default security list was modified out of band e.g. if you created a service of type LoadBalancer and you didn't override the management mode to "None". We'll go ahead and merge then.

hyder avatar Sep 08 '22 08:09 hyder