terraform-provider-kubernetes icon indicating copy to clipboard operation
terraform-provider-kubernetes copied to clipboard

Timeouts combined with wait_for can cause resource/state conflict.

Open brandocorp opened this issue 3 years ago • 2 comments

When using the new kubernetes_manifest resource, if I intentionally configure a resource such that the wait_for clause will never be met, the run will eventaully hit the timeout threshold. When this happens, the resource has been created on the kubernetes cluster. However, when I run terraform a second time, the resource appears not to exist, and then an error is thrown for an existing resource conflict.

My example is a cert-manager Issuer CRD. You must provide a valid ACME url, and an acceptable email address, otherwise the Issuer resource will remain in an invalid status condition. For example with a bogus ACME server you get stuck with the following status:

status:
  acme: {}
  conditions:
  - lastTransitionTime: "2021-10-13T19:35:17Z"
    message: 'Failed to verify ACME account: Get "https://acme.example.org": dial
      tcp: lookup acme.example.org on 10.108.0.10:53: no such host'
    reason: ErrRegisterACMEAccount
    status: "False"
    type: Ready

When the timeout is hit, the run fails, which is expected. However, the Issuer still exists on the kubernetes cluster, but Terraform seems to forget about it, and tries to create it again the next time you run an apply.

Terraform Version, Provider Version and Kubernetes Version

Terraform version: v1.0.8
Kubernetes provider version: kubernetes v2.5.0
Kubernetes version: 1.19.13-gke.1200

Affected Resource(s)

  • kubernetes_manifest

Terraform Configuration Files

# Copy-paste your Terraform configurations here - for large Terraform configs,
variable "api_version" {
  type        = string
  description = "The cert-manager API version."
  default     = "cert-manager.io/v1"
}

variable "dns_zones" {
  type        = list(string)
  default     = []
  description = "The DNS zones that can be solved by this solver."
}

variable "email" {
  type        = string
  description = "The email address for the issuer."
}

variable "kind" {
  type        = string
  description = "The Kind of the issuer (e.g. Issuer or ClusterIssuer)."

  validation {
    condition     = var.kind == "Issuer" || var.kind == "ClusterIssuer"
    error_message = "The issuer kind must be one of: ClusterIssuer, Issuer."
  }
}

variable "name" {
  type        = string
  description = "The name of the issuer."
}

variable "namespace" {
  type        = string
  description = "The kubernetes namespace to use when provisioning the issuer."
  default     = null
}

variable "project_id" {
  type        = string
  description = "The GCP Project ID."
}

variable "server" {
  type        = string
  description = "The server for the issuer."
}


resource "kubernetes_manifest" "issuer" {
  for_each = var.kind == "ClusterIssuer" ? toset([]) : toset([var.name])

  manifest = {
    apiVersion = var.api_version
    kind       = var.kind

    metadata = {
      name      = var.name
      namespace = var.namespace
    }

    spec = {
      acme = {
        email  = var.email
        server = var.server
        privateKeySecretRef = {
          name = "${var.name}-secret"
        }

        solvers = [
          {
            dns01 = {
              cloudDNS = {
                project = var.project_id
              }
            }
            selector = {
              dnsZones = var.dns_zones
            }
          },
        ]
      }
    }
  }

  wait_for = {
    fields = {
      "status.conditions[0].type"   = "Ready"
      "status.conditions[0].status" = "True"
    }
  }

  timeouts {
    create = "2m"
    update = "2m"
    delete = "2m"
  }
}

resource "kubernetes_manifest" "cluster_issuer" {
  for_each = var.kind == "ClusterIssuer" ? toset([var.name]) : toset([])

  manifest = {
    apiVersion = var.api_version
    kind       = var.kind

    metadata = {
      name = var.name
    }

    spec = {
      acme = {
        email  = var.email
        server = var.server
        privateKeySecretRef = {
          name = "${var.name}-secret"
        }

        solvers = [
          {
            dns01 = {
              cloudDNS = {
                project = var.project_id
              }
            }
            selector = {
              dnsZones = var.dns_zones
            }
          },
        ]
      }
    }
  }

  wait_for = {
    fields = {
      "status.conditions[0].type"   = "Ready"
      "status.conditions[0].status" = "True"
    }
  }

  timeouts {
    create = "2m"
    update = "2m"
    delete = "2m"
  }
}

Debug Output

Initial execution

       module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"]: Still creating... [2m0s elapsed]
       module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"]: Still creating... [2m0s elapsed]
       ╷
       │ Error: Error waiting for operation to complete
       │ 
       │   with module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"],
       │   on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
       │   64: resource "kubernetes_manifest" "cluster_issuer" {
       │ 
       │ Get
       │ "https://10.10.10.10/apis/cert-manager.io/v1/clusterissuers/clouddns-cluster-issuer":
       │ context deadline exceeded
       ╵
       ╷
       │ Error: Operation timed out
       │ 
       │   with module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"],
       │   on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
       │   64: resource "kubernetes_manifest" "cluster_issuer" {
       │ 
       │ Terraform timed out waiting on the operation to complete
       ╵

Second execution

       Plan: 2 to add, 0 to change, 0 to destroy.
       module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"]: Creating...
       module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"]: Creating...
       ╷
       │ Error: Cannot create resource that already exists
       │ 
       │   with module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"],
       │   on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
       │   64: resource "kubernetes_manifest" "cluster_issuer" {
       │ 
       │ resource "/clouddns-issuer" already exists
       ╵
       ╷
       │ Error: Cannot create resource that already exists
       │ 
       │   with module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"],
       │   on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
       │   64: resource "kubernetes_manifest" "cluster_issuer" {
       │ 
       │ resource "/clouddns-cluster-issuer" already exists
       ╵

Panic Output

Steps to Reproduce

  1. Create a kubernetes cluster
  2. Install cert-manager
  3. Attempt to apply the above configuration as a module with the kubernetes provider pointed at the cluster which has cert-manager installed.
module "issuer" {
  source = "modules/cert-manager-issuer"

  project_id = "my-gcp-project"
  kind       = "Issuer"
  name       = "clouddns-issuer"
  namespace  = "cert-manager"
  email      = "[email protected]"
  server     = "https://acme.example.org"
}

Expected Behavior

What should have happened?

The run should timeout, but the resource should still be in the state, or be tracked in some way?

Actual Behavior

What actually happened?

When the timeout is hit, the run fails, which is expected. However, the Issuer still exists on the Kubernetes cluster, but Terraform seems to forget about it, and tries to create it again the next time you run an apply.

Important Factoids

References

  • None

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

brandocorp avatar Oct 13 '21 20:10 brandocorp

We are hitting this bug regularly as well.

jwjs36987 avatar Apr 04 '22 09:04 jwjs36987

I have been having this bug with Deployments.

I don't want to modify our Service resource until the deployment is ready, so I have a wait_for block that checks the status of the deployment. If the create timeout is exceeded, then the deployment still exists, but is not tracked by Terraform, so subsequent terraform apply calls fail with Error: Cannot create resource that already exists.

Our current solution is to avoid using the wait_for argument in favor of the following:

resource "kubernetes_manifest" "deployment" {
  manifest = ...
}

resource "null_resource" "deployment_ready" {
  depends_on = [kubernetes_manifest.deployment]

  provisioner "local-exec" {
    command = "kubectl rollout status deployment ${kubernetes_manifest.deployment.object.metadata.name} --timeout=10m"
  }
}

resource "kubernetes_manifest" "service" {
  depends_on = [null_resource.deployment_ready]
  
  manifest = ...
}

This works for our use case, but would not work for everyone. The Terraform documentation also has an opinion:

Provisioners are a Last Resort

So I would hesitate to recommend this solution long term.

chrismilson avatar May 24 '22 23:05 chrismilson

Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you!

github-actions[bot] avatar May 26 '23 00:05 github-actions[bot]

This is still an issue for us. For a resource like this that has a wait block, but times out, would it be possible to add the resource to the state but have it be tainted instead of not adding anything to the terraform state at all? This would be similar to how terraform already handles failed provisioners on resources - if the provisioner fails the resource is still in state but tainted.

I imagine there is the possibility that this might not be possible to implement - provisioners are a terraform-level parameter and to the provider the resource is created fine, so it has to get into the state. Is there a way to inform teraform that the provider created the resource but it should be marked as tainted?

This solution isn't perfect either - if you wait for 5 minutes, timeout, and commit to creating a tainted resource, but eventually the resource reaches the state you were expecting, the resource will still be tainted. In this situation it would be nice to have the provider realise on the next run that the target state has been reached and automatically untaint it instead of trying to recreate and potentially fail again...

Perhaps this issue will be gone once the import block is implemented in terraform 1.5 and the solution is to accompany Kubernetes resources that have wait configurations with import blocks.

chrismilson avatar May 26 '23 15:05 chrismilson