terraform-provider-kubernetes icon indicating copy to clipboard operation
terraform-provider-kubernetes copied to clipboard

Kubernetes cluster reachable in one provider, unreachable in other

Open Spritekin opened this issue 1 year ago • 9 comments

Description

I have used terraform and EKS for years with no problems dynamically creating and destroying clusters, no big deal. Lately I had to setup a new cluster with a fresh setup so I upgraded all module versions to the latest versions. Since then I got many problems with the kubernetes, helm and kubectl providers, specifically the popular: "Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable"

So what bugs me is that I can install and destroy the whole EKS cluster, but if I leave it untouched for a couple days, then I try to change anything I get the error above.

Then, I decided to get to the end of this unstability so i enabled the logs at TF_LOG = TRACE level (https://support.hashicorp.com/hc/en-us/articles/360001113727-Enabling-trace-level-logs-in-Terraform-CLI-Cloud-or-Enterprise) I'm attaching the whole log (account codes and cluster ids redacted) used for this post. So let me summarise the setup. The module (attached to this post) goes like this:

  • Install AWS boilerplate resources
  • Create Kubernetes provider with alias init (kubernetes.init). This is classic EKS module initialisation but using an alias and a role.
provider "kubernetes" {
  host                   = module.kubernetes_cluster.cluster_endpoint
  cluster_ca_certificate = base64decode(module.kubernetes_cluster.cluster_certificate_authority_data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", module.kubernetes_cluster.cluster_name, "--role-arn", var.kubernetes_access_role ]
    command     = "aws"
  }
  alias                  = "init"
}
module "kubernetes_cluster" {
  source                 = "terraform-aws-modules/eks/aws"
  version                = "19.15.3"
  providers = {
    kubernetes = kubernetes.init
  }
  cluster_name           = local.cluster_name
  ...
}
  • Create Kubernetes cluster USING THE INIT PROVIDER.
  • Create two data objects to read the existing cluster by name... they depend on the eks cluster module finishing so they expect the cluster to exist.
data "aws_eks_cluster" "cluster" {
  name = local.cluster_name
  depends_on = [module.kubernetes_cluster]
}

data "aws_eks_cluster_auth" "cluster" {
  name = local.cluster_name
  depends_on = [module.kubernetes_cluster]
}
  • Create a kubernetes, a helm and a kubectl providers using the data objects. These providers do not use the EKS module outputs.
## Connect to the cluster
provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", local.cluster_name, "--role-arn", var.kubernetes_access_role ]
    command     = "aws"
  }
}

# Define kubernetes based providers
provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.cluster.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
    exec {
      api_version = "client.authentication.k8s.io/v1beta1"
      args        = ["eks", "get-token", "--cluster-name", local.cluster_name, "--role-arn", var.kubernetes_access_role ]
      command     = "aws"
    }
  }
}


provider "kubectl" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", local.cluster_name, "--role-arn", var.kubernetes_access_role ]
    command     = "aws"
  }
}
  • Install cluster resources (secrets, helm charts, etc etc) using those 3 providers.

So how the log goes relative to the steps above:

  • Install AWS boilerplate resources Check, you can read the logs and verify those are not a problem at all.

  • Create Kubernetes provider with alias init (kubernetes.init). Check, read the logs below. You can see the provider being correctly initialised. The "aws eks get-token" command is set correctly.

2023-07-04T05:26:02.441Z [WARN]  ValidateProviderConfig from "module.au_support.provider[\"registry.terraform.io/hashicorp/kubernetes\"].init" changed the config value, but that value is unused
2023-07-04T05:26:02.441Z [TRACE] GRPCProvider: ConfigureProvider
2023-07-04T05:26:02.443Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: 2023-07-04T05:26:02.442Z 
[TRACE] [Configure]: [ClientConfig]="{https://CE1EBXXXXXXXXXXXXXX7DAD916FBE7.gr7.ap-southeast-2.eks.amazonaws.com  {  <nil> 0xc0014c41c0}     {  [] map[]} <nil> <nil> api.ExecConfig{Command: \"aws\", Args: []string{\"--- REDACTED ---\"}, 
Env: []ExecEnvVar(nil), APIVersion: \"client.authentication.k8s.io/v1beta1\", ProvideClusterInfo: false, Config: runtime.Object(nil), StdinUnavailable: false} rest.sanitizedTLSClientConfig{Insecure:false, ServerName:\"\", CertFile:\"\", KeyFile:\"\", 
CAFile:\"\", CertData:[]uint8(nil), KeyData:[]uint8(nil), CAData:[]uint8{0x2d, 0x2d, 0x2d, 0x2d, 0x2d, 0x42, 0x45, 0x47, 0x49, 0x4e, 0x20, 0x43, 0x45 ....
  • Create Kubernetes cluster USING THE INIT PROVIDER. Check, in this case, as the cluster is already created, it SUCCESSFULLY reads the aws-auth configmap using the EKS endpoint in kubernetes.init provider:
2023-07-04T05:26:23.359Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: 2023/07/04 05:26:23 [DEBUG] Kubernetes API Request Details:
2023-07-04T05:26:23.359Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: ---[ REQUEST ]---------------------------------------
2023-07-04T05:26:23.359Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: GET /api/v1/namespaces/kube-system/configmaps/aws-auth HTTP/1.1
2023-07-04T05:26:23.359Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: Host: CE1EXXXXXXXXXXXXXXXXXXXXX6FBE7.gr7.ap-southeast-2.eks.amazonaws.com

2023-07-04T05:26:24.766Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: 2023/07/04 05:26:24 [DEBUG] Kubernetes API Response Details:
2023-07-04T05:26:24.766Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: ---[ RESPONSE ]--------------------------------------
2023-07-04T05:26:24.766Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: HTTP/2.0 200 OK
2023-07-04T05:26:24.766Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5:  "kind": "ConfigMap",
2023-07-04T05:26:24.766Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5:  "apiVersion": "v1",
2023-07-04T05:26:24.766Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5:  "metadata": {
2023-07-04T05:26:24.766Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5:   "name": "aws-auth",
  • Create two data objects to read the existing cluster by name... they depend on the eks cluster module finishing so they expect the cluster to exist. Check. because there are read confrmations.
2023-07-04T05:26:24.776Z [TRACE] writeChange: recorded Read change for module.au_support.data.aws_eks_cluster.cluster
2023-07-04T05:26:24.776Z [TRACE] vertex "module.au_support.output.kubernetes_cluster": visit complete
2023-07-04T05:26:24.777Z [TRACE] writeChange: recorded Read change for module.au_support.data.aws_eks_cluster_auth.cluster
2023-07-04T05:26:24.777Z [TRACE] vertex "module.au_support.data.aws_eks_cluster_auth.cluster": visit complete
  • Create a kubernetes, a helm and a kubectl providers using the data objects. Apparently they start. However we don't see the initialisation pattern of the kubernetes.init (read the steps above). I don't think that's correct, we should be seeing the aws eks get-token command as before.
2023-07-04T05:26:25.009Z [DEBUG] provider: starting plugin: path=.terraform/providers/registry.terraform.io/hashicorp/kubernetes/2.21.1/linux_amd64/terraform-provider-kubernetes_v2.21.1_x5 args=[.terraform/providers/registry.terraform.io/hashicorp/kubernetes/2.21.1/linux_amd64/terraform-provider-kubernetes_v2.21.1_x5]
2023-07-04T05:26:25.011Z [DEBUG] provider: plugin started: path=.terraform/providers/registry.terraform.io/hashicorp/kubernetes/2.21.1/linux_amd64/terraform-provider-kubernetes_v2.21.1_x5 pid=503
2023-07-04T05:26:25.011Z [DEBUG] provider: waiting for RPC address: path=.terraform/providers/registry.terraform.io/hashicorp/kubernetes/2.21.1/linux_amd64/terraform-provider-kubernetes_v2.21.1_x5
2023-07-04T05:26:25.068Z [INFO]  provider.terraform-provider-kubernetes_v2.21.1_x5: configuring server automatic mTLS: timestamp=2023-07-04T05:26:25.068Z
2023-07-04T05:26:25.086Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: plugin address: address=/tmp/plugin2899961592 network=unix timestamp=2023-07-04T05:26:25.086Z
2023-07-04T05:26:25.086Z [DEBUG] provider: using plugin: version=5
2023-07-04T05:26:25.103Z [TRACE] BuiltinEvalContext: Initialized "module.au_support.provider[\"registry.terraform.io/hashicorp/kubernetes\"]" provider for module.au_support.provider["registry.terraform.io/hashicorp/kubernetes"]
2023-07-04T05:26:25.103Z [TRACE] NodeApplyableProvider: configuring module.au_support.provider["registry.terraform.io/hashicorp/kubernetes"]
2023-07-04T05:26:25.103Z [TRACE] buildProviderConfig for module.au_support.provider["registry.terraform.io/hashicorp/kubernetes"]: using explicit config only

All three kubernetes, helm and kubectl providers use the same pattern and all of them look initialised. However from then on any attempt to read any kubernetes, helm or kubectl resource does this:

2023-07-04T05:26:25.171Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: ---[ REQUEST ]---------------------------------------
2023-07-04T05:26:25.171Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: GET /api/v1/namespaces/istio-system HTTP/1.1
2023-07-04T05:26:25.171Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: Host: localhost
2023-07-04T05:26:25.171Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: User-Agent: HashiCorp/1.0 Terraform/1.5.0
2023-07-04T05:26:25.171Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: Accept: application/json, */*
2023-07-04T05:26:25.171Z [DEBUG] provider.terraform-provider-kubernetes_v2.21.1_x5: Accept-Encoding: gzip

Which obviously fails because its trying to read on localhost. And it returns a "no configuration has been provided" which is obviously false as you can see in the three provider definition code above.

2023-07-04T05:26:25.117Z [INFO]  provider.terraform-provider-helm_v2.10.1_x5: 2023/07/04 05:26:25 [DEBUG] Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable: timestamp=2023-07-04T05:26:25.117Z

However there is no output where we can see why this is reposting an error.

Notes:

  • The role is not the problem, the same "aws eks" command is used successfully in the kubernetes init provider to fetch the aws-auth data. I also can use the role to connect to the cluster.
  • EKS aws-auth is not the problem either because the same role can fetch the configmap.

This looks like some problem in the latest provider versions. I never had this problem with previous versions.

So any ideas on how to debug this please?

Final note: I added the KUBERNETES_MASTER environment variable just to see if I can make it stop looking for "localhost"... same error.

  • [ X ] ✋ I have searched the open/closed issues and my issue is not listed (it appears but dismissed)

Versions

  • Module version [Required]: 5.1.0
  • Terraform version: TFC Terraform v1.5.0 on linux_amd64
  • Provider version(s):
provider registry.terraform.io/hashicorp/aws has 5.1.0 to satisfy ">= 3.72.0, >= 4.47.0, >= 4.57.0, 5.1.0"
registry.terraform.io/hashicorp/time has 0.9.1 to satisfy ">= 0.9.0"
registry.terraform.io/hashicorp/tls has 4.0.4 to satisfy ">= 3.0.0"
registry.terraform.io/carlpett/sops has 0.6.2 to satisfy "0.6.2"
registry.terraform.io/hashicorp/null has 3.2.1 to satisfy "~> 3.2.1"
registry.terraform.io/hashicorp/random has 3.5.1 to satisfy "~> 3.5.1"
registry.terraform.io/hashicorp/kubernetes has 2.21.1 to satisfy ">= 2.10.0, 2.21.1"
registry.terraform.io/hashicorp/helm has 2.10.1 to satisfy "2.10.1"
registry.terraform.io/gavinbunney/kubectl has 1.14.0 to satisfy "1.14.0"
registry.terraform.io/hashicorp/cloudinit has 2.3.2 to satisfy ">= 2.0.0"
registry.terraform.io/hashicorp/local has 2.4.0 to satisfy "~> 2.4.0"

Reproduction Code [Required]

terraform-aws-module-environment.zip

Steps to reproduce the behavior:

Yes, TFC workspaces.

Environment executed in TFC.

Installed attached module. Destroyed all, reinstalled, did a couple times ALL GOOD. Left alone for a couple days. Tried to add a role in aws-auth then all resources installed on top of the cluster started failing with "Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable" on each one. Note this is not the first time it happens. I thought it was a one time problem so I destroyed all manually and reinstalled, all good. Then tested destroying all stack and recreated all good. Then left a couple days and the problem appeared again.

Expected behavior

To continue building and destroying with no changes.

Actual behavior

Two days later it started throwing the errors above.

Additional context

The error appeared when trying to add a new role to the aws-auth configmap. It is not added, the run does nos pas the planning phase. I know the roles used are ok because I can assume the same roles and access the cluster via kubectl. I also know the logs mention a difference between the aws-auth and the terraform resource dates and is because I added and removed a role to test access AFTER the problem appeared.

run-bm3dFtcu3iv7hkZ6-plan-log.txt.zip

Spritekin avatar Jul 04 '23 09:07 Spritekin

Hi @Spritekin,

In your code in the data.aws_eks_cluster.cluster section you have depends_on which makes it depend on the module and because of that Kubernetes and Helm providers blocks don't get host and cluster_ca_certificate and thus turns into default values. We always recommend separate cluster and resource management to avoid issues like this.

I hope it helps.

Thank you.

arybolovlev avatar Jul 12 '23 15:07 arybolovlev

@arybolovlev

No sorry your explanation does not explain the behaviour. This code works correctly one day, I can create, destroy and install charts one day. even add more charts to my TF code and it works fine. But I leave it alone a couple days and it won't work.

As far as I know, depends_on is a coordination mechanism that allows for a resource, even a data source, to wait for the creation or validation of another resource. In the example above the cluster named by the local.cluster_name variable already exists so there is no reason for the data object not to read the information of the cluster regardless of the time it needs to wait.

Spritekin avatar Jul 17 '23 03:07 Spritekin

@arybolovlev Thanks a lot. Your suggestion fixed the issue for me.

alfredocambera avatar Dec 04 '23 16:12 alfredocambera

I have the same issue now. I even tried to hardcode host and cluster_ca_certificate values. The cluster is still reachable via helm and kubectl with the same user. But terraform provider gives me Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

daemon4d-us avatar Dec 18 '23 19:12 daemon4d-us

Hi there, @daemon4d-us, could you manage to figure out what the issue was? I am having the same trouble rn

acaceres-tw avatar Feb 07 '24 17:02 acaceres-tw

Can confirm this happened to me as well. Subscribing.

nickveldrin avatar Feb 15 '24 16:02 nickveldrin

Same error. Subscribing.

malodie avatar Feb 16 '24 15:02 malodie

Just FYI, I did many tests on this problem. I can guarantee the problem does not appear when running on Terraform 1.3.5 , The problem does appear when I upgraded to terraform 1.5.0 or greater. I haven't tested on versions between. When using TF 1.3.5 I can freely create and destroy the stacks with Kubernetes i.e.:

provider kubernetes {}
module eks {
... init eks using kubernetes
}
provider helm {}
provider kubectl {}
... Then work with helm or kubectl normally.

This setup works fine and I can create, modify or destroy stacks in TF 1.3.5 WITH NO PROBLEM.

When I switch to terraform 1.5.0 or higher the dependency problem appears and it can break anytime with the problems I described above.

So the only solutions I got are: a. Stay on 1.3.5 which is obviously a temporal solution because security. b. Use 1.5.0+ but separate the EKS creation in a separate module then the helm and kubectl in a separate module. This is a messy solution in the end. My architechture is much more comples and harder to explain due to this.

This is an important problem and needs to be fixed.

Spritekin avatar Feb 19 '24 02:02 Spritekin

@Spritekin Upon reading your comment, I did some testing with various versions and for me the issue exists in 1.5.7 and only in 1.5.7. Higher version, such as 1.6.x and 1.7.x do not exhibit this behavior, nor do lower versions.

I've upgraded to 1.7.3 and all is fine now. So thanks for pointing me in the right direction, I guess.

fwierda avatar Feb 20 '24 20:02 fwierda