Intermittient DNS problem: networking error looking up CAA for xxx
Describe the bug: We request a lot of certificates from letsencrypt via cert-manager (a few hundreds to a thousand per day). Most of the time this works just fine.
However, we sometimes see the following error in cert-manager (usually one cert out of a batch of 20-30)
cert-manager-5f5fc6f7b6-npplb cert-manager-controller E0115 12:32:37.405955 1 sync.go:379] "cert-manager/challenges/acceptChallenge: error waiting for authorization" err="acme: authorization error for xxx.yyy.ourdomain.zzz: 400 urn:ietf:params:acme:error:dns: DNS problem: networking error looking up CAA for ourdomain.zzz" resource_name="xxx-yyy-crt-1-142783028-3717206225" resource_namespace="xxx" resource_kind="Challenge" resource_version="v1" dnsName="xxx.yyy.ourdomain.zzz" type="DNS-01"
Corresponding order is failed and certificate is unready. This is similar to/same as #6388 or #3594.
Expected behaviour: This should not fail spuriously. If it fails it should retry faster.
Steps to reproduce the bug: Request a lot of letsencrypt/ACME certificates. Approximatly one per 500 will fail.
Anything else we need to know?:
Environment details::
- Kubernetes version: 1.24
- Cloud-provider/provisioner: AWS via Kops
- cert-manager version: 1.13.3
- Install method: e.g. helm
/kind bug
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
/close
@cert-manager-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.