VM & AMI templates set/updated via terraform are not reflected in the UI
Internal reference: SURE-5065 Reported in 2.6.6
Issue description: When updating a VM template, the changes made do not take in the UI. See GitHub issue for more info: https://github.com/rancher/terraform-provider-rancher2/issues/857
This issue was also noticed when changing AMI's of downstream amazon cluster.
Business impact: This is not clear, can be confusing or very concerning to users.
Repro steps:
The issue seems to be after the creation step: when the user goes in and edits a Vsphere VM template of a node pool for a downstream cluster created in Rancher via Rancher terraform provider, it allows them to click the Save button, but when the UI is refreshed, it still shows the original VM template and does not appear to change the cluster on the back end at all (e.g. rolling nodes to new VM).
Workaround: None
Actual behavior: Shows incorrect VM template (or AMI)
Expected behavior: Shows correct VM template (or AMI)
Terraform to UI settings can be tricky. Would need to repro to dig in more.
Found some Terraform samples here in case this helps up reproduce what to do: https://registry.terraform.io/providers/rancher/rancher2/latest/docs
Update: this is also reproducible on AWS. A Terraform template file is available via the SURE ticket.
Let's try this on AWS with an AMI first since that will be quicker. This may point to a past issue involving labels not being applied accordingly (need to find & link).
Bumping out based on current capacity and what's automatable.
Pushing to Q4.
Pushing to Q1 now due to other priority work and other releases.
Some updates from last week regarding this issue:
Conditions
- Since I lack the access to vsphere + it needs a VPN and running a local backend, I tested this with amazon EC2 provider
- The
changing AMI'sI considered as changing theAMI ID
Findings
- Cannot reproduce the issue running the latest TF provider for amazon, updating AMI ID on a cluster with similar conditions as the one decribed on the JIRA issue (check
main.tfthere) - Sometimes are two requests done (PUT) when updating the machine-pool, which can happen because of conflict 409 when updating the actual resource. Can this lead to inconsistencies? 🤔
- There's an issue updating the AMI ID with a TF provided cluster with TWO machine-pools. This differs from the JIRA issue reported, but it should be investigated further as I haven't been able to pinpoint the root cause and probably be logged as a separate issue.
Here's the cluster config on where this separate issue has been found:
terraform {
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "3.1.1"
# version = "1.24.0"
}
aws = {
source = "hashicorp/aws"
version = "5.19.0"
}
}
}
provider "rancher2" {
api_url = "<!-- your Rancher api_url -->"
access_key = "<!-- enter your access_key from Rancher API token () -->"
secret_key = "<!-- enter your secret_key from Rancher API token () -->"
insecure = true
}
provider "aws" {
region = "us-west-2"
access_key = "<!-- enter your access_key from AWS credentials -->"
secret_key = "<!-- enter your secret_key from AWS credentials -->"
}
# Creating Rancher v2 amazonec2 cluster v2
# Create amazonec2 cloud credential
resource "rancher2_cloud_credential" "foo-creds" {
name = "foo"
amazonec2_credential_config {
access_key = "<!-- enter your access_key from AWS credentials -->"
secret_key = "<!-- enter your secret_key from AWS credentials -->"
}
}
# Create amazonec2 machine config v2
resource "rancher2_machine_config_v2" "foo" {
generate_name = "test-foo"
amazonec2_config {
ami = ""
region = "us-west-2"
security_group = ["rancher-nodes"]
subnet_id = ""
vpc_id = "vpc-007f1f25ac3fb5b34" # check your available VPCs and get an ID
zone = "a"
}
}
# Create a new rancher v2 Cluster with multiple machine pools
resource "rancher2_cluster_v2" "foo-rke2" {
name = "foo-rke2"
kubernetes_version = "v1.26.8+rke2r1"
enable_network_policy = false
default_cluster_role_for_project_members = "user"
rke_config {
machine_pools {
name = "pool1"
cloud_credential_secret_name = rancher2_cloud_credential.foo-creds.id
control_plane_role = true
etcd_role = true
worker_role = false
quantity = 1
drain_before_delete = true
machine_config {
kind = rancher2_machine_config_v2.foo.kind
name = rancher2_machine_config_v2.foo.name
}
}
# Each machine pool must be passed separately
machine_pools {
name = "pool2"
cloud_credential_secret_name = rancher2_cloud_credential.foo-creds.id
control_plane_role = false
etcd_role = false
worker_role = true
quantity = 2
drain_before_delete = true
machine_config {
kind = rancher2_machine_config_v2.foo.kind
name = rancher2_machine_config_v2.foo.name
}
}
}
}
FYI @richard-cox
@momesgin take a look at the tf file that comes with the JIRA issue and compare with the above configuration. They should be pretty similar.
I was able to successfully update the VM template through UI from mo-ubuntu-20.04-cloudimg to jammy-2-cloudimg-amd64 for an RKE2 vSphere cluster that was provisioned in Rancher via Terraform:
terraform {
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "1.24.0"
}
vsphere = {
source = "hashicorp/vsphere"
version = "2.2.0"
}
}
}
# Provider bootstrap config
provider "rancher2" {
api_url = "..."
access_key = "..."
secret_key = "..."
insecure = true
}
provider "vsphere" {
user = "..."
password = "..."
vsphere_server = "..."
allow_unverified_ssl = true
}
data "vsphere_datacenter" "datacenter" {
name = "/Datacenter"
}
data "vsphere_folder" "folder" {
path = "/Datacenter/vm/mo"
}
data "vsphere_virtual_machine" "template" {
name = "/Datacenter/vm/mo/mo-ubuntu-20.04-cloudimg"
datacenter_id = data.vsphere_datacenter.datacenter.id
}
data "vsphere_datastore" "datastore" {
name = "datastore1"
datacenter_id = data.vsphere_datacenter.datacenter.id
}
resource "rancher2_machine_config_v2" "foo" {
generate_name = "mo-tf"
vsphere_config {
datastore = data.vsphere_datastore.datastore.name
cpu_count = "4"
memory_size = "4096"
disk_size = "20000"
creation_type = "template"
clone_from = data.vsphere_virtual_machine.template.name
folder = data.vsphere_folder.folder.path
}
}
# Create a new rancher2 RKE2 Cluster
resource "rancher2_cluster_v2" "mo-tf" {
name = "foo-custom"
kubernetes_version = "v1.22.11+rke2r1"
enable_network_policy = false
default_cluster_role_for_project_members = "user"
rke_config {
machine_pools {
name = "pool1"
cloud_credential_secret_name = "..."
control_plane_role = true
etcd_role = true
worker_role = true
quantity = 1
machine_config {
kind = rancher2_machine_config_v2.foo.kind
name = rancher2_machine_config_v2.foo.name
}
}
}
}
Updating:
https://github.com/rancher/dashboard/assets/135728925/5984aa02-dede-4d75-93dc-727853ef3651
After the update being finished:
https://github.com/rancher/dashboard/assets/135728925/76d6be11-437e-4b36-bb90-08526cf8affa
For the EKS nodegroup part, with launch template having custom AMI, on build v2.8-4350b89f75e08530c9e9c082dca6e4328eabf453-head, still seeing issue https://github.com/rancher/dashboard/issues/9406
Reproduced on Rancher v2.8.1
Passed on Rancher v2.8.3-rc5 I was able to change the template to other newly created using the vSphere console. I changed the setting to the new using the UI form, the cluster start updating with new nodes and start creating them using the new template the old nodes get deleted and the cluster get active, all new nodes in the cluster use the new template.
@gaktive, since this has been approved by QA would this close SURE-5065?