dashboard VM & AMI templates set/updated via terraform are not reflected in the UI

Internal reference: SURE-5065 Reported in 2.6.6

Issue description: When updating a VM template, the changes made do not take in the UI. See GitHub issue for more info: https://github.com/rancher/terraform-provider-rancher2/issues/857

This issue was also noticed when changing AMI's of downstream amazon cluster.

Business impact: This is not clear, can be confusing or very concerning to users.

Repro steps:

The issue seems to be after the creation step: when the user goes in and edits a Vsphere VM template of a node pool for a downstream cluster created in Rancher via Rancher terraform provider, it allows them to click the Save button, but when the UI is refreshed, it still shows the original VM template and does not appear to change the cluster on the back end at all (e.g. rolling nodes to new VM).

Workaround: None

Actual behavior: Shows incorrect VM template (or AMI)

Expected behavior: Shows correct VM template (or AMI)

Aug 02 '22 22:08 gaktive

Terraform to UI settings can be tricky. Would need to repro to dig in more.

Aug 22 '22 15:08 gaktive

Found some Terraform samples here in case this helps up reproduce what to do: https://registry.terraform.io/providers/rancher/rancher2/latest/docs

Nov 26 '22 00:11 gaktive

Update: this is also reproducible on AWS. A Terraform template file is available via the SURE ticket.

Dec 02 '22 16:12 gaktive

Let's try this on AWS with an AMI first since that will be quicker. This may point to a past issue involving labels not being applied accordingly (need to find & link).

Jan 31 '23 16:01 gaktive

Bumping out based on current capacity and what's automatable.

Apr 24 '23 15:04 gaktive

Pushing to Q4.

Jul 26 '23 05:07 gaktive

Pushing to Q1 now due to other priority work and other releases.

Oct 03 '23 15:10 gaktive

Some updates from last week regarding this issue:

Conditions

Since I lack the access to vsphere + it needs a VPN and running a local backend, I tested this with amazon EC2 provider
The changing AMI's I considered as changing the AMI ID

Findings

Cannot reproduce the issue running the latest TF provider for amazon, updating AMI ID on a cluster with similar conditions as the one decribed on the JIRA issue (check main.tf there)
Sometimes are two requests done (PUT) when updating the machine-pool, which can happen because of conflict 409 when updating the actual resource. Can this lead to inconsistencies? 🤔
There's an issue updating the AMI ID with a TF provided cluster with TWO machine-pools. This differs from the JIRA issue reported, but it should be investigated further as I haven't been able to pinpoint the root cause and probably be logged as a separate issue.

Here's the cluster config on where this separate issue has been found:

terraform {
  required_providers {
    rancher2 = {
      source  = "rancher/rancher2"
      version = "3.1.1"
      # version = "1.24.0"
    }
    aws = {
      source  = "hashicorp/aws"
      version = "5.19.0"
    }
  }
}

provider "rancher2" {
  api_url   = "<!-- your Rancher api_url -->"
  access_key = "<!-- enter your access_key from Rancher API token () -->"
  secret_key = "<!-- enter your secret_key from Rancher API token () -->"
  insecure = true
}

provider "aws" {
  region     = "us-west-2"
  access_key = "<!-- enter your access_key from AWS credentials -->"
   secret_key = "<!-- enter your secret_key from AWS credentials -->"
}


# Creating Rancher v2 amazonec2 cluster v2
# Create amazonec2 cloud credential
resource "rancher2_cloud_credential" "foo-creds" {
  name = "foo"
  amazonec2_credential_config {
    access_key = "<!-- enter your access_key from AWS credentials -->"
    secret_key = "<!-- enter your secret_key from AWS credentials -->"
  }
}

# Create amazonec2 machine config v2
resource "rancher2_machine_config_v2" "foo" {
  generate_name = "test-foo"
  amazonec2_config {
    ami            = ""
    region         = "us-west-2"
    security_group = ["rancher-nodes"]
    subnet_id      = ""
    vpc_id         = "vpc-007f1f25ac3fb5b34" # check your available VPCs and get an ID
    zone           = "a"
  }
}

# Create a new rancher v2 Cluster with multiple machine pools
resource "rancher2_cluster_v2" "foo-rke2" {
  name                                     = "foo-rke2"
  kubernetes_version                       = "v1.26.8+rke2r1"
  enable_network_policy                    = false
  default_cluster_role_for_project_members = "user"
  rke_config {
    machine_pools {
      name                         = "pool1"
      cloud_credential_secret_name = rancher2_cloud_credential.foo-creds.id
      control_plane_role           = true
      etcd_role                    = true
      worker_role                  = false
      quantity                     = 1
      drain_before_delete          = true
      machine_config {
        kind = rancher2_machine_config_v2.foo.kind
        name = rancher2_machine_config_v2.foo.name
      }
    }
    # Each machine pool must be passed separately
    machine_pools {
      name                         = "pool2"
      cloud_credential_secret_name = rancher2_cloud_credential.foo-creds.id
      control_plane_role           = false
      etcd_role                    = false
      worker_role                  = true
      quantity                     = 2
      drain_before_delete          = true
      machine_config {
        kind = rancher2_machine_config_v2.foo.kind
        name = rancher2_machine_config_v2.foo.name
      }
    }
  }
}

FYI @richard-cox

Oct 09 '23 11:10 aalves08

@momesgin take a look at the tf file that comes with the JIRA issue and compare with the above configuration. They should be pretty similar.

Nov 17 '23 09:11 aalves08

I was able to successfully update the VM template through UI from mo-ubuntu-20.04-cloudimg to jammy-2-cloudimg-amd64 for an RKE2 vSphere cluster that was provisioned in Rancher via Terraform:

terraform {
  required_providers {
    rancher2 = {
      source = "rancher/rancher2"
      version = "1.24.0"
    }
    vsphere = {
      source = "hashicorp/vsphere"
      version = "2.2.0"
    }
  }
}

# Provider bootstrap config
provider "rancher2" {
  api_url   = "..."
  access_key = "..."
  secret_key = "..."
  insecure = true
}

provider "vsphere" {
  user           = "..."
  password       = "..."
  vsphere_server = "..."
  allow_unverified_ssl = true
}

data "vsphere_datacenter" "datacenter" {
  name          = "/Datacenter"
}

data "vsphere_folder" "folder" {
  path = "/Datacenter/vm/mo"
}

data "vsphere_virtual_machine" "template" {
  name          = "/Datacenter/vm/mo/mo-ubuntu-20.04-cloudimg"
  datacenter_id = data.vsphere_datacenter.datacenter.id
}

data "vsphere_datastore" "datastore" {
  name          = "datastore1"
  datacenter_id = data.vsphere_datacenter.datacenter.id
}

resource "rancher2_machine_config_v2" "foo" {
  generate_name = "mo-tf"
  vsphere_config {
    datastore       = data.vsphere_datastore.datastore.name
    cpu_count = "4"
    memory_size = "4096"
    disk_size = "20000"
    creation_type = "template"
    clone_from = data.vsphere_virtual_machine.template.name
    folder = data.vsphere_folder.folder.path
  }
}

# Create a new rancher2 RKE2 Cluster
resource "rancher2_cluster_v2" "mo-tf" {
  name = "foo-custom"
  kubernetes_version = "v1.22.11+rke2r1"
  enable_network_policy = false
  default_cluster_role_for_project_members = "user"
  rke_config {
    machine_pools {
      name = "pool1"
      cloud_credential_secret_name = "..."
      control_plane_role = true
      etcd_role = true
      worker_role = true
      quantity = 1
      machine_config {
        kind = rancher2_machine_config_v2.foo.kind
        name = rancher2_machine_config_v2.foo.name
      }
    }
  }
}

Updating:

https://github.com/rancher/dashboard/assets/135728925/5984aa02-dede-4d75-93dc-727853ef3651

After the update being finished:

https://github.com/rancher/dashboard/assets/135728925/76d6be11-437e-4b36-bb90-08526cf8affa

Nov 21 '23 05:11 momesgin

For the EKS nodegroup part, with launch template having custom AMI, on build v2.8-4350b89f75e08530c9e9c082dca6e4328eabf453-head, still seeing issue https://github.com/rancher/dashboard/issues/9406

Mar 13 '24 14:03 cpinjani

Reproduced on Rancher v2.8.1

Passed on Rancher v2.8.3-rc5 I was able to change the template to other newly created using the vSphere console. I changed the setting to the new using the UI form, the cluster start updating with new nodes and start creating them using the new template the old nodes get deleted and the cluster get active, all new nodes in the cluster use the new template.

Mar 20 '24 18:03 izaac

@gaktive, since this has been approved by QA would this close SURE-5065?

Apr 10 '24 12:04 aalves08