cert-manager Intermittient DNS problem: networking error looking up CAA for xxx

Describe the bug: We request a lot of certificates from letsencrypt via cert-manager (a few hundreds to a thousand per day). Most of the time this works just fine.

However, we sometimes see the following error in cert-manager (usually one cert out of a batch of 20-30)

cert-manager-5f5fc6f7b6-npplb cert-manager-controller E0115 12:32:37.405955       1 sync.go:379] "cert-manager/challenges/acceptChallenge: error waiting for authorization" err="acme: authorization error for xxx.yyy.ourdomain.zzz: 400 urn:ietf:params:acme:error:dns: DNS problem: networking error looking up CAA for ourdomain.zzz" resource_name="xxx-yyy-crt-1-142783028-3717206225" resource_namespace="xxx" resource_kind="Challenge" resource_version="v1" dnsName="xxx.yyy.ourdomain.zzz" type="DNS-01"

Corresponding order is failed and certificate is unready. This is similar to/same as #6388 or #3594.

Expected behaviour: This should not fail spuriously. If it fails it should retry faster.

Steps to reproduce the bug: Request a lot of letsencrypt/ACME certificates. Approximatly one per 500 will fail.

Anything else we need to know?:

Environment details::

Kubernetes version: 1.24
Cloud-provider/provisioner: AWS via Kops
cert-manager version: 1.13.3
Install method: e.g. helm

/kind bug

Jan 15 '24 13:01 jan-kantert

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

Apr 14 '24 14:04 jetstack-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. /lifecycle rotten /remove-lifecycle stale

May 14 '24 14:05 cert-manager-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten. /close

Jun 13 '24 14:06 cert-manager-bot

@cert-manager-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 13 '24 14:06 cert-manager-prow[bot]