terraform-provider-kubernetes
terraform-provider-kubernetes copied to clipboard
Timeouts combined with wait_for can cause resource/state conflict.
When using the new kubernetes_manifest
resource, if I intentionally configure a resource such that the wait_for
clause will never be met, the run will eventaully hit the timeout threshold. When this happens, the resource has been created on the kubernetes cluster. However, when I run terraform a second time, the resource appears not to exist, and then an error is thrown for an existing resource conflict.
My example is a cert-manager Issuer
CRD. You must provide a valid ACME url, and an acceptable email address, otherwise the Issuer
resource will remain in an invalid status condition. For example with a bogus ACME server you get stuck with the following status:
status:
acme: {}
conditions:
- lastTransitionTime: "2021-10-13T19:35:17Z"
message: 'Failed to verify ACME account: Get "https://acme.example.org": dial
tcp: lookup acme.example.org on 10.108.0.10:53: no such host'
reason: ErrRegisterACMEAccount
status: "False"
type: Ready
When the timeout is hit, the run fails, which is expected. However, the Issuer
still exists on the kubernetes cluster, but Terraform seems to forget about it, and tries to create it again the next time you run an apply.
Terraform Version, Provider Version and Kubernetes Version
Terraform version: v1.0.8
Kubernetes provider version: kubernetes v2.5.0
Kubernetes version: 1.19.13-gke.1200
Affected Resource(s)
- kubernetes_manifest
Terraform Configuration Files
# Copy-paste your Terraform configurations here - for large Terraform configs,
variable "api_version" {
type = string
description = "The cert-manager API version."
default = "cert-manager.io/v1"
}
variable "dns_zones" {
type = list(string)
default = []
description = "The DNS zones that can be solved by this solver."
}
variable "email" {
type = string
description = "The email address for the issuer."
}
variable "kind" {
type = string
description = "The Kind of the issuer (e.g. Issuer or ClusterIssuer)."
validation {
condition = var.kind == "Issuer" || var.kind == "ClusterIssuer"
error_message = "The issuer kind must be one of: ClusterIssuer, Issuer."
}
}
variable "name" {
type = string
description = "The name of the issuer."
}
variable "namespace" {
type = string
description = "The kubernetes namespace to use when provisioning the issuer."
default = null
}
variable "project_id" {
type = string
description = "The GCP Project ID."
}
variable "server" {
type = string
description = "The server for the issuer."
}
resource "kubernetes_manifest" "issuer" {
for_each = var.kind == "ClusterIssuer" ? toset([]) : toset([var.name])
manifest = {
apiVersion = var.api_version
kind = var.kind
metadata = {
name = var.name
namespace = var.namespace
}
spec = {
acme = {
email = var.email
server = var.server
privateKeySecretRef = {
name = "${var.name}-secret"
}
solvers = [
{
dns01 = {
cloudDNS = {
project = var.project_id
}
}
selector = {
dnsZones = var.dns_zones
}
},
]
}
}
}
wait_for = {
fields = {
"status.conditions[0].type" = "Ready"
"status.conditions[0].status" = "True"
}
}
timeouts {
create = "2m"
update = "2m"
delete = "2m"
}
}
resource "kubernetes_manifest" "cluster_issuer" {
for_each = var.kind == "ClusterIssuer" ? toset([var.name]) : toset([])
manifest = {
apiVersion = var.api_version
kind = var.kind
metadata = {
name = var.name
}
spec = {
acme = {
email = var.email
server = var.server
privateKeySecretRef = {
name = "${var.name}-secret"
}
solvers = [
{
dns01 = {
cloudDNS = {
project = var.project_id
}
}
selector = {
dnsZones = var.dns_zones
}
},
]
}
}
}
wait_for = {
fields = {
"status.conditions[0].type" = "Ready"
"status.conditions[0].status" = "True"
}
}
timeouts {
create = "2m"
update = "2m"
delete = "2m"
}
}
Debug Output
Initial execution
module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"]: Still creating... [2m0s elapsed]
module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"]: Still creating... [2m0s elapsed]
╷
│ Error: Error waiting for operation to complete
│
│ with module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"],
│ on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
│ 64: resource "kubernetes_manifest" "cluster_issuer" {
│
│ Get
│ "https://10.10.10.10/apis/cert-manager.io/v1/clusterissuers/clouddns-cluster-issuer":
│ context deadline exceeded
╵
╷
│ Error: Operation timed out
│
│ with module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"],
│ on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
│ 64: resource "kubernetes_manifest" "cluster_issuer" {
│
│ Terraform timed out waiting on the operation to complete
╵
Second execution
Plan: 2 to add, 0 to change, 0 to destroy.
module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"]: Creating...
module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"]: Creating...
╷
│ Error: Cannot create resource that already exists
│
│ with module.issuer.kubernetes_manifest.cluster_issuer["clouddns-issuer"],
│ on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
│ 64: resource "kubernetes_manifest" "cluster_issuer" {
│
│ resource "/clouddns-issuer" already exists
╵
╷
│ Error: Cannot create resource that already exists
│
│ with module.cluster_issuer.kubernetes_manifest.cluster_issuer["clouddns-cluster-issuer"],
│ on ../../../main.tf line 64, in resource "kubernetes_manifest" "cluster_issuer":
│ 64: resource "kubernetes_manifest" "cluster_issuer" {
│
│ resource "/clouddns-cluster-issuer" already exists
╵
Panic Output
Steps to Reproduce
- Create a kubernetes cluster
- Install cert-manager
- Attempt to apply the above configuration as a module with the kubernetes provider pointed at the cluster which has cert-manager installed.
module "issuer" {
source = "modules/cert-manager-issuer"
project_id = "my-gcp-project"
kind = "Issuer"
name = "clouddns-issuer"
namespace = "cert-manager"
email = "[email protected]"
server = "https://acme.example.org"
}
Expected Behavior
What should have happened?
The run should timeout, but the resource should still be in the state, or be tracked in some way?
Actual Behavior
What actually happened?
When the timeout is hit, the run fails, which is expected. However, the Issuer
still exists on the Kubernetes cluster, but Terraform seems to forget about it, and tries to create it again the next time you run an apply.
Important Factoids
References
- None
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
We are hitting this bug regularly as well.
I have been having this bug with Deployments.
I don't want to modify our Service resource until the deployment is ready, so I have a wait_for block that checks the status of the deployment. If the create timeout is exceeded, then the deployment still exists, but is not tracked by Terraform, so subsequent terraform apply
calls fail with Error: Cannot create resource that already exists
.
Our current solution is to avoid using the wait_for
argument in favor of the following:
resource "kubernetes_manifest" "deployment" {
manifest = ...
}
resource "null_resource" "deployment_ready" {
depends_on = [kubernetes_manifest.deployment]
provisioner "local-exec" {
command = "kubectl rollout status deployment ${kubernetes_manifest.deployment.object.metadata.name} --timeout=10m"
}
}
resource "kubernetes_manifest" "service" {
depends_on = [null_resource.deployment_ready]
manifest = ...
}
This works for our use case, but would not work for everyone. The Terraform documentation also has an opinion:
Provisioners are a Last Resort
So I would hesitate to recommend this solution long term.
Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you!
This is still an issue for us. For a resource like this that has a wait block, but times out, would it be possible to add the resource to the state but have it be tainted instead of not adding anything to the terraform state at all? This would be similar to how terraform already handles failed provisioners on resources - if the provisioner fails the resource is still in state but tainted.
I imagine there is the possibility that this might not be possible to implement - provisioners are a terraform-level parameter and to the provider the resource is created fine, so it has to get into the state. Is there a way to inform teraform that the provider created the resource but it should be marked as tainted?
This solution isn't perfect either - if you wait for 5 minutes, timeout, and commit to creating a tainted resource, but eventually the resource reaches the state you were expecting, the resource will still be tainted. In this situation it would be nice to have the provider realise on the next run that the target state has been reached and automatically untaint it instead of trying to recreate and potentially fail again...
Perhaps this issue will be gone once the import block is implemented in terraform 1.5 and the solution is to accompany Kubernetes resources that have wait configurations with import blocks.