cert-manager Race condition when two identical certificate requests are made from different clusters

Describe the bug:

Seems like we have a race condition when two Kubernetes clusters with similar (near identical) configuration try to request a certificate using DNS01 to create the same certificate.

We have two Kubernetes clusters, one primary and the other secondary for disaster recovery. Both are in different Azure regions and configured almost identically.

Identical Certificate resources are created on both clusters within a minute of each other. The first one to get a response from LetsEncrypt deletes the _acme-challenge DNS record and the second cluster's Certificate resource is left in a READY state of False indefinitely.

---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-live
spec:
  acme:
    preferredChain: ""
    privateKeySecretRef:
      name: letsencrypt-key-secret-live
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
    - dns01:
        azureDNS:
          environment: AzurePublicCloud
          hostedZoneName: env.REDACTED.com
          resourceGroupName: REDACTED_RESOURCE_GROUP_NAME
          subscriptionID: REDACTED_SUBSCRIPTION_ID
        cnameStrategy: Follow
      selector:
        dnsZones:
        - REDACTED.com

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: FOO-prod-certificate
  namespace: FOO-prod
spec:
  dnsNames:
  - FOO.REDACTED.com
  duration: 2160h0m0s
  issuerRef:
    kind: ClusterIssuer
    name: letsencrypt-live
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  renewBefore: 720h0m0s
  secretName: FOO-prod-certificate
  usages:
  - server auth
  - client auth
  - key encipherment
  - digital signature
...

Expected behaviour:

The second cluster should detect that the _acme-challenge DNS record was deleted, and then re-attempt the request with LetsEncrypt.

Steps to reproduce the bug:

See above.

Anything else we need to know?:

Environment details::

Kubernetes version: 1.25.6
Cloud-provider/provisioner: Azure
cert-manager version: quay.io/jetstack/cert-manager-acmesolver:v1.11.1
Install method: Helm Chart Version 1.11.1

/kind bug

Jul 21 '23 07:07 pedrohdz

We've had this issue with a couple different DNS providers and has been solved for Route53 and CloudDNS so far, see

https://github.com/cert-manager/cert-manager/pull/6088
https://github.com/cert-manager/cert-manager/pull/4793

There probably is some way how to achieve the same for Azure DNS. None of the maintainers are very familiar with Azure and we also don't have Azure infra to test this, so perhaps would be good if an external contributor who is an Azure user and can test it on Azure could pick this up.

Jul 21 '23 08:07 irbekrm

The same problem exists for cloudflare dns

Aug 18 '23 07:08 michalg91

For Azure DNS this can be solved using optimistic concurrency their APIs support.

I have some working PoC I never attempted to upstream as I'm quite clueless how to properly and consistently test something like this: https://github.com/eplightning/cert-manager/commit/2f1fbb1f824a7549d1fa913c13e4120f3aeb7fd3

Aug 28 '23 09:08 eplightning

We are hitting the same issue with the Cloudflare provider.

Would a more simple approach be to allow the text record name to be changed to something other than _acme-challenge.*?

That way clusterA could use _acme-challenge-clusterA and clusterB could use _acme-challenge-clusterB TXT records.

This would then fix it for all providers. I believe this is hardcoded here but could potentially be exposed as a config item (env, start flags or clusterIssuer spec)

This would also be easier to add test cases for

Aug 31 '23 13:08 chr15murray

@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.

Aug 31 '23 16:08 michalg91

@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.

Thanks, will look to get this deployed. Thanks @michalg91

Aug 31 '23 17:08 chr15murray

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Nov 29 '23 18:11 jetstack-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle stale

Dec 29 '23 18:12 jetstack-bot

/remove-lifecycle rotten

Jan 07 '24 23:01 eplightning

Continues to be an issue for Azure DNS. There's a PR open that should fix the issue: https://github.com/cert-manager/cert-manager/pull/6351

Jan 07 '24 23:01 eplightning

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

Apr 06 '24 23:04 jetstack-bot

/remove-lifecycle stale

Apr 06 '24 23:04 eplightning

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

Jul 06 '24 00:07 cert-manager-bot

/remove-lifecycle stale

Jul 29 '24 10:07 GMartinez-Sisti

Also seeing this issue with the cloudDNS (GCP) issuer.

Jul 29 '24 10:07 GMartinez-Sisti