terraform-provider-rancher2 icon indicating copy to clipboard operation
terraform-provider-rancher2 copied to clipboard

Data sources create new kubeconfig token when running `terraform plan` and `terraform apply`

Open sgapanovich opened this issue 3 years ago • 3 comments

Setup:

Rancher version: 2.6.3 Rancher cluster type: HA Terraform provider rancher 2 version: "1.22.2"

Steps to reproduce:

  1. Create a cluster using rancher2_cluster and/or rancherv2_cluster_v2 resources
  2. Add data sources rancher2_cluster and/or rancher2_cluster_v2 with appropriate cluster names

Issue:

New kubeconfig token is created every time you run terraform plan or terraform apply per each data source (rancher2_cluster and rancherv2_cluster_v2)

sgapanovich avatar Jan 07 '22 17:01 sgapanovich

Just off a cursory look, is this not the behavior you'd want/expect from these commands?

eliyamlevy avatar Mar 31 '22 19:03 eliyamlevy

I've just noticed that it's doing this for me as well. I have my RANCHER_ACCESS_KEY and RANCHER_SECRET_KEY env vars set, then whenever I run a terraform plan (or apply) it creates a new kubeconfig token that never expires, even through I've only got the following data block:

data "rancher2_cluster" "cluster" {
  name = var.rancher_cluster_name
}

is this not the behavior you'd want/expect from these commands?

@eliyamlevy No. Why is it creating them at all, just to get information about a rancher cluster? Shouldn't the credentials in my env vars be sufficient to get this data? Also, why doesn't it set any expiry?

flyte avatar May 27 '22 11:05 flyte

I see why it's creating them now. It returns a generated kube_config from the rancher2_cluster data block.

Perhaps this should be configurable (generate credentials, expiry etc.) and by default not generate credentials? I'm only using this data block in order to get the cluster ID.

I just had to delete 65 generated credentials from my Rancher profile because it had created one each time I ran plan or apply.

flyte avatar May 27 '22 11:05 flyte

up

raelix avatar Nov 09 '22 16:11 raelix

We've got the same Problem. We do provisioning almost everything with terraform and so we use often rancher2_cluster module. At the end our etcd of Rancher stored over 32k kubeconfig and the etcd performance went down.

Workaround on our side right now - to delete those tokens daily by a cronjob Would be nice if this configurable like @flyte already suggest.

jmederer avatar Dec 21 '22 12:12 jmederer

I have investigated this issue and this is expected in Rancher but a confirmed bug in Rancher TF provider.

Terraform downloads and caches a kube_config in the state file on every run of a terraform plan or terraform apply, and does validation https://github.com/rancher/terraform-provider-rancher2/blob/2c3bddda8a470d9a5f7b91009436d9eb4838a33b/rancher2/resource_rancher2_cluster.go#L668 that likely has outdated logic.

We are replacing the token if we generate a new kubeconfig https://github.com/rancher/terraform-provider-rancher2/blob/2c3bddda8a470d9a5f7b91009436d9eb4838a33b/rancher2/resource_rancher2_cluster.go#L707-L716 instead of using the token from the cached kube_config. This logic needs to be updated so that when a user runs a terraform plan or terraform apply the API token from the cached kube_config (if it exists) will be used instead of generating a new one every time and clogging Rancher.

a-blender avatar Mar 16 '23 21:03 a-blender

There is also some discussion to add a no_kubeconfig bool field to not download a kubeconfig on every run of terraform plan and terraform apply which is currently under review. This is not a part of the fix for this issue but would be a separate enhancement.

a-blender avatar Mar 16 '23 21:03 a-blender

FYI no_kubeconfig enhancement is logged as https://github.com/rancher/terraform-provider-rancher2/issues/1095

snasovich avatar Apr 22 '23 20:04 snasovich

Is a workaround use a secure place like a vault, to store and retrieve the kubeconfig?

ebuildy avatar Jun 26 '23 17:06 ebuildy

@ebuildy That hasn't been approved as a workaround, no. Are you trying this on your end? The current advised workaround is to set kubeconfig-default-token-ttl-minutes in Rancher global settings to a shorter duration so any expired tokens are cleaned up. However, tf sets new kubeconfig tokens with no expiry date so this may not work for many cases.

a-blender avatar Jun 30 '23 21:06 a-blender

QA Test Template

Issue: https://github.com/rancher/terraform-provider-rancher2/issues/841

Problem

When using TF output or data resources that changes after the apply, all subsequent runs of terraform plan creates a new kubeconfig API token until terraform apply is run again to get the output and data changes. This was seen on Rancher 2.6.9 and 2.6.10, and 2.7.3 in the oc space.

Solution

After investigation, the root cause of this issue is likely that Terraform downloads / generates a kubeconfig on every run of a terraform plan or apply. TF replaces the kubeconfig token every time instead of using the token from the cached kubeconfig, which is causing the over generation of API tokens.

My solution is to update the getClusterKubeconfig logic explained here to use the API token from the cached kubeconfig (if it exists) instead of always replacing it.

Testing

Engineering Testing

Manual Testing

Test plan

  • Run Rancher instance latest v2.6 with rc v3.1.0-rc3
  • Provision a rancher2_cluster via Terraform (I did a 3 node rke cluster on Amazon EC2 nodes) by running terraform apply with your configuration
  • Check how many kubeconfig API tokens there exist in the cluster
  • Run terraform plan 3 times
  • Check how many kubeconfig API tokens there exist in the cluster. Verify no new tokens were generated. This means TF is correctly using the token from the cached kubeconfig.
main.tf
terraform {
  required_providers {
    rancher2 = {
      source  = "terraform.local/local/rancher2"
      version = "1.0.0"
    }
  }
}
provider "rancher2" {
  api_url   = var.rancher_api_url
  token_key = var.rancher_admin_bearer_token
  insecure  = true
}
data "rancher2_cloud_credential" "rancher2_cloud_credential" {
  name = var.cloud_credential_name
}
resource "rancher2_cluster" "rancher2_cluster" {
  name = var.rke_cluster_name
  rke_config {
    kubernetes_version = "v1.26.4-rancher2-1"
    network {
      plugin = var.rke_network_plugin
    }
  }
}
resource "rancher2_node_template" "rancher2_node_template" {
  name = var.rke_node_template_name
  amazonec2_config {
    access_key           = var.aws_access_key
	  secret_key     = var.aws_secret_key
	  region         = var.aws_region
          ami            = var.aws_ami
	  security_group = [var.aws_security_group_name]
	  subnet_id      = var.aws_subnet_id
	  vpc_id         = var.aws_vpc_id
	  zone           = var.aws_zone_letter
	  root_size      = var.aws_root_size
	  instance_type  = var.aws_instance_type
  }
}
resource "rancher2_node_pool" "pool1" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool1"
  hostname_prefix  = "tf-pool1-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = true 
  worker           = false 
}
resource "rancher2_node_pool" "pool2" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool2"
  hostname_prefix  = "tf-pool2-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = true
  etcd             = false 
  worker           = false 
}
resource "rancher2_node_pool" "pool3" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool3"
  hostname_prefix  = "tf-pool3-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = false 
  worker           = true 
}

Check number of tokens

kubectl get token.management.cattle.io | wc -l

Run test plan on latest rancher 2.7 and verify the same behavior.

Automated Testing

QA Testing Considerations

Regressions Considerations

Terraform rancher2_cluster Update / kubeconfig generation.

a-blender avatar Jun 30 '23 21:06 a-blender

We use data "rancher2_cluster" "tools" { to get k8s cluster details (host + auth) in order to configure kubectl provider:

provider "kubectl" {
  host              = try(yamldecode(data.rancher2_cluster.tools[0].kube_config).clusters[0].cluster.server, null)
  token             = try(yamldecode(data.rancher2_cluster.tools[0].kube_config).users[0].user.token, null)
}

As I am quite new with Rancher + Terraform, not sure if this is a good thing? many thanks

ebuildy avatar Jul 01 '23 13:07 ebuildy

Some offline discussion cont. on this issue and from additional testing, discovered a need to further refactor the kubeconfig token replace and util code. Essentially, this is an update to handle the expired/non-existent token cases for a cached kubeconfig gracefully without forcing the user to re-provision their cluster. PR is linked to this issue because it's related to the original API token fix in Terraform.

@daviswill2 I will cut another Terraform RC and add a test template with steps for the comprehensive testing I did on my end with the tokens once merged.

a-blender avatar Jul 11 '23 16:07 a-blender

QA Test Template

Please test this issue to verify the original bug was fixed using Test Template and run the manual checks specified in the refactor PR description to verify other invalid token cases are handled. Please test using v3.1.0-rc5 once it's available. Lmk if you have any questions!

a-blender avatar Jul 14 '23 14:07 a-blender

thaneunsoo said: ### Test Environment: ### Rancher version: v2.7-head Rancher cluster type: HA Docker version: 20.10


Testing:

Tested the following:

  1. Provision a single node, all-role rke cluster via TF.
  2. Verify the kubeconfig was downloaded correctly
  3. Run terraform plan 3 times. Verify the same cached kubeconfig is being used and additional tokens were not generated
  4. Set kubeconfig-default-token-ttl-minutes to 2m and run tf update to add 1 node to the cluster. Verify a token is created with 2m expiry date
  5. Run tf apply to add 1 node to the cluster. Verify a new token is created before the old one expires
  6. Run tf apply to add 1 node to the cluster. Verify a new token is created once the old one is removed / does not exist anymore
  7. Edit terraform.tfstate file and set kube_config to a corrupt id, perhaps TESTTOKEN. Run tf apply to add 1 node to the cluster. Verify a new token is created since the old one is corrupt, no errors

Result New token was not created when running terraform plan 3 times. New token was created when config was manually corrupted.

zube[bot] avatar Jul 19 '23 18:07 zube[bot]

@a-blender

Sure this is fixed ? Im using v3.1.1 against the Rancher 2.7.5 local cluster and im getting a new token each run.

        users:
        - name: "local"
          user:
      -     token: "kubeconfig-user-bks46hkm7v:4zw6vdxm5fcx7v2l2rp55997nf7mqvg2kcktjjs8b7xvss2z2mfn8w"
      +     token: "kubeconfig-user-bks467vgl2:x5tfgbjxb2trcjzsvcbn2bx476xqmvlbn5tvcjlfvrgnvm7kw8z7dg"
        

But the old one is still valid i would think...

apiVersion: management.cattle.io/v3
authProvider: local
current: false
description: Kubeconfig token
expired: false
expiresAt: ""
isDerived: true
kind: Token
lastUpdateTime: ""
metadata:
  creationTimestamp: "2023-08-28T07:54:22Z"
  generateName: kubeconfig-user-bks46
  generation: 1
  labels:
    authn.management.cattle.io/kind: kubeconfig
    authn.management.cattle.io/token-userId: user-bks46
    cattle.io/creator: norman
  name: kubeconfig-user-bks46hkm7v
  resourceVersion: "49050440"
  uid: 80048daa-221c-4ad6-964d-aa7887a28bd2
token: 4zw6vdxm5fcx7v2l2rp55997nf7mqvg2kcktjjs8b7xvss2z2mfn8w
ttl: 0
userId: user-bks46
userPrincipal:
  displayName: Default Admin
  loginName: admin
  me: true
  metadata:
    creationTimestamp: null
    name: local://user-bks46
  principalType: user
  provider: local

terraform config is as simple as it gets:

provider "rancher2" {
  api_url    = var.rancher_url
  access_key = var.rancher_access_key
  secret_key = var.rancher_secret_key
}

data "rancher2_cluster" "rancher-local-cluster" {
  name = "local"
  
}

erSitzt avatar Aug 28 '23 08:08 erSitzt

@erSitzt Hello, what are your repro steps? Each run of what cmd? If that's happening, I would think your kubeconfig is not usable or your token/s is expired.

a-blender avatar Aug 28 '23 15:08 a-blender

@a-blender you're right cert in the kubeconfig is not valid, did not see that before.

erSitzt avatar Aug 29 '23 07:08 erSitzt

@a-blender Hmm.. i fixed my kubeconfig issue, but the datasource is still creating new tokens every time.

So i tested what happens wenn i click "Copy KubeConfig..." in Rancher

image

These are all from the UI

image

And this generates new tokens every time as well... so this is no provider bug i guess ? :)

All tokens generated are not expiring... happens in Rancher 2.7.4 and 2.7.5

I will open an issue over there..

erSitzt avatar Aug 31 '23 13:08 erSitzt

Ok... so i think i figured it out now.

Creating a rancher2_cluster resource like so

resource "rancher2_cluster" "mycluster" {
  name = "clustername"
  description = "imported cluster"
 ...and so on
}

and then using rancher2_cluster.mycluster.kube_config will reuse the existing kube_config from the state and will not create a new token

using the datasource, not referencing the resource directly, even if it is created in the same terraform project, like this

provider "rancher2" {
  api_url    = var.rancher_url
  access_key = var.rancher_access_key
  secret_key = var.rancher_secret_key
}

data "rancher2_cluster" "rancher-local-cluster" {
  name = "local"
  
}

will recreate the token each time and will not use anything from the state file.

@a-blender your change was done to the rancher2_cluster resource, is there anything different when coming from the datasource using only the cluster name ?

erSitzt avatar Sep 01 '23 07:09 erSitzt

I've added some more log output and somehow origconfig is empty here

if len(origconfig) > 0 {

Therefore skipping that part and always creating a new token.

The three lines in the log with Data : should show kube_config, name and driver, but only name seems to work

2023-09-01T12:48:09.913+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Refreshing Cluster ID local
2023-09-01T12:48:09.913+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Data :
2023-09-01T12:48:09.913+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Data : local
2023-09-01T12:48:09.914+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Data :
2023-09-01T12:48:09.914+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [DEBUG] Waiting for state to become: [success]
2023-09-01T12:48:09.921+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [TRACE] Finding cluster registration token for local
2023-09-01T12:48:09.928+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Found existing cluster registration token for local
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Now in : getClusterKubeconfig
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] len(origconfig) is 0
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] origconfig :
2023-09-01T12:48:09.934+0200 [DEBUG] provider.terraform-provider-rancher2_v3.1.1: 2023/09/01 12:48:09 [INFO] Somehow we ended up here, wanting a new kubeconfig

erSitzt avatar Sep 01 '23 10:09 erSitzt