Race condition when two identical certificate requests are made from different clusters
Describe the bug:
Seems like we have a race condition when two Kubernetes clusters with similar (near identical) configuration try to request a certificate using DNS01 to create the same certificate.
We have two Kubernetes clusters, one primary and the other secondary for disaster recovery. Both are in different Azure regions and configured almost identically.
Identical Certificate resources are created on both clusters within a minute of each other. The first one to get a response from LetsEncrypt deletes the _acme-challenge DNS record and the second cluster's Certificate resource is left in a READY state of False indefinitely.
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-live
spec:
acme:
preferredChain: ""
privateKeySecretRef:
name: letsencrypt-key-secret-live
server: https://acme-v02.api.letsencrypt.org/directory
solvers:
- dns01:
azureDNS:
environment: AzurePublicCloud
hostedZoneName: env.REDACTED.com
resourceGroupName: REDACTED_RESOURCE_GROUP_NAME
subscriptionID: REDACTED_SUBSCRIPTION_ID
cnameStrategy: Follow
selector:
dnsZones:
- REDACTED.com
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: FOO-prod-certificate
namespace: FOO-prod
spec:
dnsNames:
- FOO.REDACTED.com
duration: 2160h0m0s
issuerRef:
kind: ClusterIssuer
name: letsencrypt-live
privateKey:
algorithm: RSA
encoding: PKCS1
size: 2048
renewBefore: 720h0m0s
secretName: FOO-prod-certificate
usages:
- server auth
- client auth
- key encipherment
- digital signature
...
Expected behaviour:
The second cluster should detect that the _acme-challenge DNS record was deleted, and then re-attempt the request with LetsEncrypt.
Steps to reproduce the bug:
See above.
Anything else we need to know?:
Environment details::
- Kubernetes version:
1.25.6 - Cloud-provider/provisioner: Azure
- cert-manager version:
quay.io/jetstack/cert-manager-acmesolver:v1.11.1 - Install method: Helm Chart Version
1.11.1
/kind bug
We've had this issue with a couple different DNS providers and has been solved for Route53 and CloudDNS so far, see
- https://github.com/cert-manager/cert-manager/pull/6088
- https://github.com/cert-manager/cert-manager/pull/4793
There probably is some way how to achieve the same for Azure DNS. None of the maintainers are very familiar with Azure and we also don't have Azure infra to test this, so perhaps would be good if an external contributor who is an Azure user and can test it on Azure could pick this up.
The same problem exists for cloudflare dns
For Azure DNS this can be solved using optimistic concurrency their APIs support.
I have some working PoC I never attempted to upstream as I'm quite clueless how to properly and consistently test something like this: https://github.com/eplightning/cert-manager/commit/2f1fbb1f824a7549d1fa913c13e4120f3aeb7fd3
We are hitting the same issue with the Cloudflare provider.
Would a more simple approach be to allow the text record name to be changed to something other than _acme-challenge.*?
That way clusterA could use _acme-challenge-clusterA and clusterB could use _acme-challenge-clusterB TXT records.
This would then fix it for all providers. I believe this is hardcoded here but could potentially be exposed as a config item (env, start flags or clusterIssuer spec)
This would also be easier to add test cases for
@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.
@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.
Thanks, will look to get this deployed. Thanks @michalg91
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale
/remove-lifecycle rotten
Continues to be an issue for Azure DNS. There's a PR open that should fix the issue: https://github.com/cert-manager/cert-manager/pull/6351
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
Also seeing this issue with the cloudDNS (GCP) issuer.