terraform-provider-rancher2
terraform-provider-rancher2 copied to clipboard
Data sources create new kubeconfig token when running `terraform plan` and `terraform apply`
Setup:
Rancher version: 2.6.3 Rancher cluster type: HA Terraform provider rancher 2 version: "1.22.2"
Steps to reproduce:
- Create a cluster using
rancher2_clusterand/orrancherv2_cluster_v2resources - Add data sources
rancher2_clusterand/orrancher2_cluster_v2with appropriate cluster names
Issue:
New kubeconfig token is created every time you run terraform plan or terraform apply per each data source (rancher2_cluster and rancherv2_cluster_v2)
Just off a cursory look, is this not the behavior you'd want/expect from these commands?
I've just noticed that it's doing this for me as well. I have my RANCHER_ACCESS_KEY and RANCHER_SECRET_KEY env vars set, then whenever I run a terraform plan (or apply) it creates a new kubeconfig token that never expires, even through I've only got the following data block:
data "rancher2_cluster" "cluster" {
name = var.rancher_cluster_name
}
is this not the behavior you'd want/expect from these commands?
@eliyamlevy No. Why is it creating them at all, just to get information about a rancher cluster? Shouldn't the credentials in my env vars be sufficient to get this data? Also, why doesn't it set any expiry?
I see why it's creating them now. It returns a generated kube_config from the rancher2_cluster data block.
Perhaps this should be configurable (generate credentials, expiry etc.) and by default not generate credentials? I'm only using this data block in order to get the cluster ID.
I just had to delete 65 generated credentials from my Rancher profile because it had created one each time I ran plan or apply.
up
We've got the same Problem. We do provisioning almost everything with terraform and so we use often rancher2_cluster module. At the end our etcd of Rancher stored over 32k kubeconfig and the etcd performance went down.
Workaround on our side right now - to delete those tokens daily by a cronjob Would be nice if this configurable like @flyte already suggest.
I have investigated this issue and this is expected in Rancher but a confirmed bug in Rancher TF provider.
Terraform downloads and caches a kube_config in the state file on every run of a terraform plan or terraform apply, and does validation https://github.com/rancher/terraform-provider-rancher2/blob/2c3bddda8a470d9a5f7b91009436d9eb4838a33b/rancher2/resource_rancher2_cluster.go#L668 that likely has outdated logic.
We are replacing the token if we generate a new kubeconfig https://github.com/rancher/terraform-provider-rancher2/blob/2c3bddda8a470d9a5f7b91009436d9eb4838a33b/rancher2/resource_rancher2_cluster.go#L707-L716 instead of using the token from the cached kube_config. This logic needs to be updated so that when a user runs a terraform plan or terraform apply the API token from the cached kube_config (if it exists) will be used instead of generating a new one every time and clogging Rancher.
There is also some discussion to add a no_kubeconfig bool field to not download a kubeconfig on every run of terraform plan and terraform apply which is currently under review. This is not a part of the fix for this issue but would be a separate enhancement.
FYI no_kubeconfig enhancement is logged as https://github.com/rancher/terraform-provider-rancher2/issues/1095
Is a workaround use a secure place like a vault, to store and retrieve the kubeconfig?
@ebuildy That hasn't been approved as a workaround, no. Are you trying this on your end? The current advised workaround is to set kubeconfig-default-token-ttl-minutes in Rancher global settings to a shorter duration so any expired tokens are cleaned up. However, tf sets new kubeconfig tokens with no expiry date so this may not work for many cases.
QA Test Template
Issue: https://github.com/rancher/terraform-provider-rancher2/issues/841
Problem
When using TF output or data resources that changes after the apply, all subsequent runs of terraform plan creates a new kubeconfig API token until terraform apply is run again to get the output and data changes. This was seen on Rancher 2.6.9 and 2.6.10, and 2.7.3 in the oc space.
Solution
After investigation, the root cause of this issue is likely that Terraform downloads / generates a kubeconfig on every run of a terraform plan or apply. TF replaces the kubeconfig token every time instead of using the token from the cached kubeconfig, which is causing the over generation of API tokens.
My solution is to update the getClusterKubeconfig logic explained here to use the API token from the cached kubeconfig (if it exists) instead of always replacing it.
Testing
Engineering Testing
Manual Testing
Test plan
- Run Rancher instance latest v2.6 with rc
v3.1.0-rc3 - Provision a
rancher2_clustervia Terraform (I did a 3 node rke cluster on Amazon EC2 nodes) by runningterraform applywith your configuration - Check how many kubeconfig API tokens there exist in the cluster
- Run
terraform plan3 times - Check how many kubeconfig API tokens there exist in the cluster. Verify no new tokens were generated. This means TF is correctly using the token from the cached kubeconfig.
main.tf
terraform {
required_providers {
rancher2 = {
source = "terraform.local/local/rancher2"
version = "1.0.0"
}
}
}
provider "rancher2" {
api_url = var.rancher_api_url
token_key = var.rancher_admin_bearer_token
insecure = true
}
data "rancher2_cloud_credential" "rancher2_cloud_credential" {
name = var.cloud_credential_name
}
resource "rancher2_cluster" "rancher2_cluster" {
name = var.rke_cluster_name
rke_config {
kubernetes_version = "v1.26.4-rancher2-1"
network {
plugin = var.rke_network_plugin
}
}
}
resource "rancher2_node_template" "rancher2_node_template" {
name = var.rke_node_template_name
amazonec2_config {
access_key = var.aws_access_key
secret_key = var.aws_secret_key
region = var.aws_region
ami = var.aws_ami
security_group = [var.aws_security_group_name]
subnet_id = var.aws_subnet_id
vpc_id = var.aws_vpc_id
zone = var.aws_zone_letter
root_size = var.aws_root_size
instance_type = var.aws_instance_type
}
}
resource "rancher2_node_pool" "pool1" {
cluster_id = rancher2_cluster.rancher2_cluster.id
name = "pool1"
hostname_prefix = "tf-pool1-"
node_template_id = rancher2_node_template.rancher2_node_template.id
quantity = 1
control_plane = false
etcd = true
worker = false
}
resource "rancher2_node_pool" "pool2" {
cluster_id = rancher2_cluster.rancher2_cluster.id
name = "pool2"
hostname_prefix = "tf-pool2-"
node_template_id = rancher2_node_template.rancher2_node_template.id
quantity = 1
control_plane = true
etcd = false
worker = false
}
resource "rancher2_node_pool" "pool3" {
cluster_id = rancher2_cluster.rancher2_cluster.id
name = "pool3"
hostname_prefix = "tf-pool3-"
node_template_id = rancher2_node_template.rancher2_node_template.id
quantity = 1
control_plane = false
etcd = false
worker = true
}
Check number of tokens
kubectl get token.management.cattle.io | wc -l
Run test plan on latest rancher 2.7 and verify the same behavior.
Automated Testing
QA Testing Considerations
Regressions Considerations
Terraform rancher2_cluster Update / kubeconfig generation.
We use data "rancher2_cluster" "tools" { to get k8s cluster details (host + auth) in order to configure kubectl provider:
provider "kubectl" {
host = try(yamldecode(data.rancher2_cluster.tools[0].kube_config).clusters[0].cluster.server, null)
token = try(yamldecode(data.rancher2_cluster.tools[0].kube_config).users[0].user.token, null)
}
As I am quite new with Rancher + Terraform, not sure if this is a good thing? many thanks
Some offline discussion cont. on this issue and from additional testing, discovered a need to further refactor the kubeconfig token replace and util code. Essentially, this is an update to handle the expired/non-existent token cases for a cached kubeconfig gracefully without forcing the user to re-provision their cluster. PR is linked to this issue because it's related to the original API token fix in Terraform.
@daviswill2 I will cut another Terraform RC and add a test template with steps for the comprehensive testing I did on my end with the tokens once merged.
QA Test Template
Please test this issue to verify the original bug was fixed using Test Template and run the manual checks specified in the refactor PR description to verify other invalid token cases are handled. Please test using v3.1.0-rc5 once it's available. Lmk if you have any questions!
thaneunsoo said: ### Test Environment: ### Rancher version: v2.7-head Rancher cluster type: HA Docker version: 20.10
Testing:
Tested the following:
- Provision a single node, all-role rke cluster via TF.
- Verify the kubeconfig was downloaded correctly
- Run terraform plan 3 times. Verify the same cached kubeconfig is being used and additional tokens were not generated
- Set kubeconfig-default-token-ttl-minutes to 2m and run tf update to add 1 node to the cluster. Verify a token is created with 2m expiry date
- Run tf apply to add 1 node to the cluster. Verify a new token is created before the old one expires
- Run tf apply to add 1 node to the cluster. Verify a new token is created once the old one is removed / does not exist anymore
- Edit terraform.tfstate file and set kube_config to a corrupt id, perhaps TESTTOKEN. Run tf apply to add 1 node to the cluster. Verify a new token is created since the old one is corrupt, no errors
Result New token was not created when running terraform plan 3 times. New token was created when config was manually corrupted.
@a-blender
Sure this is fixed ? Im using v3.1.1 against the Rancher 2.7.5 local cluster and im getting a new token each run.
users:
- name: "local"
user:
- token: "kubeconfig-user-bks46hkm7v:4zw6vdxm5fcx7v2l2rp55997nf7mqvg2kcktjjs8b7xvss2z2mfn8w"
+ token: "kubeconfig-user-bks467vgl2:x5tfgbjxb2trcjzsvcbn2bx476xqmvlbn5tvcjlfvrgnvm7kw8z7dg"
But the old one is still valid i would think...
apiVersion: management.cattle.io/v3
authProvider: local
current: false
description: Kubeconfig token
expired: false
expiresAt: ""
isDerived: true
kind: Token
lastUpdateTime: ""
metadata:
creationTimestamp: "2023-08-28T07:54:22Z"
generateName: kubeconfig-user-bks46
generation: 1
labels:
authn.management.cattle.io/kind: kubeconfig
authn.management.cattle.io/token-userId: user-bks46
cattle.io/creator: norman
name: kubeconfig-user-bks46hkm7v
resourceVersion: "49050440"
uid: 80048daa-221c-4ad6-964d-aa7887a28bd2
token: 4zw6vdxm5fcx7v2l2rp55997nf7mqvg2kcktjjs8b7xvss2z2mfn8w
ttl: 0
userId: user-bks46
userPrincipal:
displayName: Default Admin
loginName: admin
me: true
metadata:
creationTimestamp: null
name: local://user-bks46
principalType: user
provider: local
terraform config is as simple as it gets:
provider "rancher2" {
api_url = var.rancher_url
access_key = var.rancher_access_key
secret_key = var.rancher_secret_key
}
data "rancher2_cluster" "rancher-local-cluster" {
name = "local"
}
@erSitzt Hello, what are your repro steps? Each run of what cmd? If that's happening, I would think your kubeconfig is not usable or your token/s is expired.
@a-blender you're right cert in the kubeconfig is not valid, did not see that before.
@a-blender Hmm.. i fixed my kubeconfig issue, but the datasource is still creating new tokens every time.
So i tested what happens wenn i click "Copy KubeConfig..." in Rancher
These are all from the UI
And this generates new tokens every time as well... so this is no provider bug i guess ? :)
All tokens generated are not expiring... happens in Rancher 2.7.4 and 2.7.5
I will open an issue over there..
Ok... so i think i figured it out now.
Creating a rancher2_cluster resource like so
resource "rancher2_cluster" "mycluster" {
name = "clustername"
description = "imported cluster"
...and so on
}
and then using rancher2_cluster.mycluster.kube_config will reuse the existing kube_config from the state and will not create a new token
using the datasource, not referencing the resource directly, even if it is created in the same terraform project, like this
provider "rancher2" {
api_url = var.rancher_url
access_key = var.rancher_access_key
secret_key = var.rancher_secret_key
}
data "rancher2_cluster" "rancher-local-cluster" {
name = "local"
}
will recreate the token each time and will not use anything from the state file.
@a-blender your change was done to the rancher2_cluster resource, is there anything different when coming from the datasource using only the cluster name ?
I've added some more log output and somehow origconfig is empty here
Therefore skipping that part and always creating a new token.
The three lines in the log with Data : should show kube_config, name and driver, but only name seems to work
2023-09-01T12:48:09.913+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Refreshing Cluster ID local
2023-09-01T12:48:09.913+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Data :
2023-09-01T12:48:09.913+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Data : local
2023-09-01T12:48:09.914+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Data :
2023-09-01T12:48:09.914+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [DEBUG] Waiting for state to become: [success]
2023-09-01T12:48:09.921+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [TRACE] Finding cluster registration token for local
2023-09-01T12:48:09.928+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Found existing cluster registration token for local
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Now in : getClusterKubeconfig
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] len(origconfig) is 0
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] origconfig :
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Somehow we ended up here, wanting a new kubeconfig