terraform-provider-kubernetes icon indicating copy to clipboard operation
terraform-provider-kubernetes copied to clipboard

terraform refresh attempts to dial localhost (reopening with workaround)

Open konryd opened this issue 3 years ago • 50 comments

This is a re-opening of #546

Occasionally, the kubernetes provider will start dialing localhost instead of the configured kubeconfig context.

Error: Get http://localhost/api/v1/namespaces/prometheus: dial tcp 127.0.0.1:80: connect: connection refused
Error: Get http://localhost/api/v1/namespaces/debug: dial tcp 127.0.0.1:80: connect: connection refused

In the instance of this problem that I ran into, the reason was: multiple terraform threads opening and writing the kubeconfig file without synchronization, which resulted in a messed-up kubeconfig file. This might have been related to the fact that my terraform config included multiple clusters (using this approach)

Workaround

I was able to make this go away by setting: -parallelism=1

konryd avatar Oct 06 '20 08:10 konryd

Thanks for the provided workaround. We are also hitting this bug from time to time. I tried the parallelism approach and did not see the 'localhost issue' again. However, we went into a different issue with this.

I would love to know the reason why this bug happens at all (and why it can be mitigated by reducing the terraform threads). We are creating a kubeconfig file before we run terraform apply and are passing the path to it as a variable in our terraform modules. The kubernetes provider then just uses this path via var.kubeconfig. Still, from time to time it happens that the k8s provider wants to connect to localhost, although our file exists and the content is valid.

Here is our providers.tf:

provider "kubernetes" {
  config_path = var.kubeconfig
}

provider "helm" {
  kubernetes {
    config_path = var.kubeconfig
  }
  version = ">= 1.2.1"
}

thirdeyenick avatar Oct 27 '20 09:10 thirdeyenick

it happens to us too :(

igoooor avatar Oct 27 '20 13:10 igoooor

to be specific, in my case it happens during Refreshing state and my provider looks like this

provider "kubernetes" {
  load_config_file = false

  host                   = "https://${data.google_container_cluster.this.0.endpoint}"
  client_certificate     = data.google_container_cluster.this.0.master_auth.0.client_certificate
  client_key             = data.google_container_cluster.this.0.master_auth.0.client_key
  cluster_ca_certificate = data.google_container_cluster.this.0.master_auth.0.cluster_ca_certificate
}

If I run the same command (apply or destroy) with -refresh=false then it works fine

-parallelism=1 is not helping for me, the error is happening constantly.

igoooor avatar Oct 27 '20 15:10 igoooor

Interesting @igoooor , does it also try to connect to localhost in your case? We have a similiar issue, like the one you describe, but in those cases we just get a 'permission denied' message (no indication that it tries to connect to localhost). If we use -refresh=false then everyhing works. I have the feeling that terraform uses an old client certificate which is not valid anymore (maybe cached in the state?).

thirdeyenick avatar Oct 27 '20 16:10 thirdeyenick

in my case I get the localhost error yes, when refreshing only

Error: Get http://localhost/api/v1/namespaces/xxx: dial tcp 127.0.0.1:80: connect: connection refused

If I replace my provider config and use variables (for host, client_certificate, etc...) instead of data.google_container_cluster... then it also works at refresh time. It seems like when refreshing the state, it does not load values from data.google_container_cluster....

igoooor avatar Oct 27 '20 16:10 igoooor

Yeah, this might be the case. In most of our cases we are not using terraform data sources to fill in the access credentials, but we are still experiencing this bug. I am currently checking if I get the same issue when not using a created kubeconfig file, but passing the client_certificate, client_key, etc instead directly via variables to the provider.

thirdeyenick avatar Oct 27 '20 16:10 thirdeyenick

This only happens to me since I update to terraform 13 today. I stayed on terraform 12 until now because of some other stuff, and I never had this problem. Only now with the latest version

igoooor avatar Oct 27 '20 17:10 igoooor

I'm unable to reproduce this scenario. To me import seems to work as expected.

@igoooor Is the cluster referred to by data.google_container_cluster.this in you case already present or are you also creating the cluster in that same apply operation?

Also, everyone else, please post the versions of Terraform and provider you used.

alexsomesan avatar Oct 28 '20 15:10 alexsomesan

It is already present, before starting the terraform command.

igoooor avatar Oct 28 '20 15:10 igoooor

Alright, thanks for clarifying that. Have you tried to A-B test with providing the credentials from that same cluster via a kubeconfig file?

alexsomesan avatar Oct 28 '20 15:10 alexsomesan

it works via kubeconfig and via parameters set for host, client_certificate, etc.. but it does not work when host, client_certificate, etc.. are set from the data.google_container_cluster.this

And again, it only fails during refresh, if I apply -refresh=false then it works.

igoooor avatar Oct 28 '20 15:10 igoooor

it also happens when I'm using a resource instead of a data. Of course not the first time I apply when it creates the cluster, but afterwards if I apply again, it will try to refresh, and there it will fail as well with the same error

igoooor avatar Nov 05 '20 17:11 igoooor

For information, both workarounds doesn't work when using the remote backend:

Error: Custom parallelism values are currently not supported

The "remote" backend does not support setting a custom parallelism value at
this time.
Error: Planning without refresh is currently not supported

Currently the "remote" backend will always do an in-memory refresh of the
Terraform state prior to generating the plan.

sereinity avatar Dec 01 '20 12:12 sereinity

I am experiencing the same issue @igoooor. The only difference is that I am using digitalocean instead of google.

sodre avatar Dec 19 '20 19:12 sodre

I ran into this issue as well. What appears to have happened in my case is that I had originally created a kubernetes_secret resource in a modules main.tf. Things changed and that was removed since it was no longer needed. When the refresh happened, I guess because the original resource didn't exist, it ignores any configuration for the kubernetes provider and always tries to use localhost. Without looking at the code, I'd say if there's a secret (or maybe other k8s resource) in the state file (we use remote state), but the definition for that resource is removed, then this will happen (but that's just a guess)

Our fix is to simply remove that resource from the state manually, and then manually clean up the resource.

holleyism avatar Jan 15 '21 03:01 holleyism

Same problem here, any workarounds available ? Any of the above is irrelevant for terraform cloud unfortunately.

pduchnovsky avatar Mar 02 '21 15:03 pduchnovsky

@pduchnovsky on Terraform Cloud, you should still be able to use the above workarounds.

  1. To set -parallelism=1, you would add an environment variable named TFE_PARALLELISM and set it to 1. (see https://www.terraform.io/docs/cloud/workspaces/variables.html#special-environment-variables)
  2. All of the terraform state subcommands still work with Terraform Cloud, so if you wanted to manually delete a resource from the state for manual cleanup, you can run terraform state rm path.to.kubernetes_resource.name locally and it will update Cloud. Similarly terraform state pull and terraform state push also work, if you needed to pull the entire state file down from Terraform Cloud.

jacobwgillespie avatar Mar 02 '21 16:03 jacobwgillespie

@pduchnovsky on Terraform Cloud, you should still be able to use the above workarounds.

  1. To set -parallelism=1, you would add an environment variable named TFE_PARALLELISM and set it to 1. (see https://www.terraform.io/docs/cloud/workspaces/variables.html#special-environment-variables)
  2. All of the terraform state subcommands still work with Terraform Cloud, so if you wanted to manually delete a resource from the state for manual cleanup, you can run terraform state rm path.to.kubernetes_resource.name locally and it will update Cloud. Similarly terraform state pull and terraform state push also work, if you needed to pull the entire state file down from Terraform Cloud.

To be honest, this workaround is not really acceptable, e.g. I am creating single GKE cluster with two non-default node pools of which one is GPU enabled.. then I deploy around 10 kubernetes_deployment(s) of which one is created in average of 8 minutes (big images) and it would take AGES to deploy/update those if I set parallelism to 1. I 'could' use the older version of this provider but it doesn't work with taint "nvidia.com/gpu" since it has a dot and a slash in the name..

So for the time being I made a workaround that after cluster is created I extract it's IP and cert to variables and then use those as a reference.. of course now I cannot change the cluster itself, but that's not something we do often.

Looking forward for when the PR #1078 is merged.

pduchnovsky avatar Mar 03 '21 08:03 pduchnovsky

Also ran into this. Was able to work around it KUBECONFIG, but that has the exact issue stated above where I can no longer re-create my cluster as it depends on hard-coded/pre-existing variables instead of runtime variables.

RichiCoder1 avatar Mar 08 '21 21:03 RichiCoder1

I am not sure if this issue is related, but I have figured out that if a cluster needs recreation, then the output of host/ca certificate/etc. would be empty. This would result in the empty values being passed to the provider which results in connection to localhost error that hides the original fact that cluster needs recreation.

So you see the "localhost" error in the output, while the original problem that is hidden is that cluster will be re-created (meaning there is no host to connect to obviously before it is created).

See this issue for more details.

ilya-git avatar Mar 17 '21 09:03 ilya-git

I am also experiencing the same issue as @ilya-git where the refresh/plan will fail if a cluster that is referenced in a dynamic provider configuration needs to be recreated.

resource "aws_eks_cluster" "main" {
  # contains some update that requires a destroy/create rather than a modify
}

provider "kubernetes" {
  host = aws_eks_cluster.main.endpoint
  ...
}

This reliably produces an error message similar to the one in the initial comment on this issue, on the first attempt to refresh a resource using the kubernetes provider.

Targeting the cluster in a "first pass" and then proceeding with the rest appears to be a viable workaround; i.e. https://github.com/hashicorp/terraform/issues/4149 would appear to be a viable fix.

alex-shafer-ceshop avatar Apr 05 '21 17:04 alex-shafer-ceshop

I hit the problem by destroying an AKS Cluster. I'm passing the Kubernetes configuration from the state. provider "kubernetes" { host = azurerm_kubernetes_cluster.aks.kube_config.0.host client_key = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key) client_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate) cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate) } With that configuration, Terraform is able to create a namespace. If I initiate terraform destroy, I get Error: Get http://localhost/api/v1/namespaces/xxx: dial tcp 127.0.0.1:80: connect: connection refused when terraform is refreshing the namespace resource.

Terraform v1.0.0 on windows_amd64

  • provider registry.terraform.io/hashicorp/azurerm v2.67.0
  • provider registry.terraform.io/hashicorp/kubernetes v2.3.2

Krebsmar avatar Jul 13 '21 11:07 Krebsmar

@Krebsmar I've hit the same issue now with AKS cluster. Did you manage to resolve this?

ahilmathew avatar Jul 26 '21 04:07 ahilmathew

The workaround does not work for me sadly :/

schealex avatar Aug 10 '21 13:08 schealex

Seeing this same issue with the EKS module: https://github.com/terraform-aws-modules/terraform-aws-eks

Initial apply works fine, subsequent changes to the cluster fail with terraform attempting to connect to the k8s api on localhost. paralellism workaround has no effect.

joshuaganger avatar Aug 18 '21 20:08 joshuaganger

I've been using the Kubernetes provider version 2.4.1 and none of the above solutions works for me. My configuration use the gke_auth module to get the cluster configuration. Set the parallelism to 1 and avoid the use of kubeconfig and move to a lower version of the provider fixed the issue, now I'm using the version 2.3.2

My provider config:

` required_providers { google = { source = "hashicorp/google" version = ">=3.78.0" }

kubernetes = {
  source  = "hashicorp/kubernetes"
  version = ">= 2.3.2"
}

}

provider "kubernetes" { cluster_ca_certificate = module.gke_auth.cluster_ca_certificate host = module.gke_auth.host token = module.gke_auth.token } `

Rodrigonavarro23 avatar Aug 23 '21 12:08 Rodrigonavarro23

@Krebsmar I've hit the same issue now with AKS cluster. Did you manage to resolve this?

Hi @ahilmathew, sorry for the late response. Currently, I'm adding the refresh=false argument to my command. No better workaround for now. - terraform plan -out "planfile" -refresh=false - terraform apply -input=false "planfile"

Greats!

Krebsmar avatar Sep 01 '21 10:09 Krebsmar

Adding refresh=false can let "terraform plan" to pass but then apply step will destroy cluster. This is something you might not want to happen.

iamruss avatar Sep 14 '21 08:09 iamruss

Having exactly the same issue when trying to modify an azure k8s cluster using terraform and the kubernetes provider. Everything worked fine until I've enabled manually azure monitoring on the cluster. Now I'am getting - for all resources - always:

Get "http://localhost/apis/rbac.authorization.k8s.io/v1/clusterroles/system:azure-cloud-provider": dial tcp [::1]:80: connectex: Es konnte keine Verbindung hergestellt werden, da der Zielcomputer die Verbindung verweigerte.

rpanitzek avatar Sep 21 '21 14:09 rpanitzek

This issue seems pretty dead to me... Extremely unfortunate because I have exactly this same issue with AWS EKS. I have two clusters in my configuration and I get exactly this problem every single time. When I remove one cluster from the configuration, it works totally fine. I assume the parallelism workaround would fix it, but with the insane amount of infrastructure, it makes no sense to do that.

Is there anyone even still looking at this issue? I cannot identify the problem.

DutchEllie avatar Sep 27 '21 14:09 DutchEllie