fix: ❗dns-01 route53 query change status retry timeout
Pull Request Motivation
Today, dns-01 challenges over Route53 queries the dns record change for a time out period of hardcoded 2 minutes. This causes certificates to not get ready with multiple failing challenges on large clusters where multiple ACME certificates are being issued through LetsEncrypt (we have been seeing multiple reports of such issues for quite sometime now).
CloudDNS on the contrary retries forever with a wait period of 1s between every status check.
Kind
/kind bug
Release Note
ACME DNS-01 challenges for Route53 would query the dns record changes in Pending state and time out after a hardcoded period of 2m0s, this would cause multiple failed challenges and unready certificates for clusters requesting many certificates. The change record status query call was fixed to retry forever until they have been processed by Route53.
Hi @swghosh. Thanks for your PR.
I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/ok-to-test
Not sure how I feel about introducing an infinite loop, looking at it even a context close will not stop it (which it looks like would not have stopped the old code either, which I would be in favour of fixing).
Probably worth a chat about if we want to poll forever, especially when AWS document that any changes should be available within 60 seconds
If we are ok with it blocking forever, I think wait.PollUntilContextCancel would be better, as it handles the context close for us.
When propagation is complete, GetChange returns a status of INSYNC. Changes generally propagate to all Route 53 name servers managing the hosted zone within 60 seconds. [source]
This is however no guarantee, just a general time that can be expected. We currently poll for 2 minutes (hardcoded) then create a new ChangeRecordSet event. If AWS has an issue in their backend, we will skip ahead to the next ChangeRecordSet and poll that again for 2 minutes, infinitely until AWS resolves their issue and ChangeRecordSet get into UPSERT/applied state within 2 minutes again.
When propagation is complete, GetChange returns a status of INSYNC. Changes generally propagate to all Route 53 name servers managing the hosted zone within 60 seconds. [source]
This is however no guarantee, just a general time that can be expected. We currently poll for 2 minutes then create a new
ChangeRecordSetevent. If AWS has an issue in their backend, we will skip ahead to the nextChangeRecordSetand poll that again for 2 minutes, infinitely until AWS resolves their issue andChangeRecordSetget into UPSERT/applied state within 2 minutes again.
@typeid despite the UPSERT, challenges get stuck forever and there are traces of the error in reported logs (as seen in OCPBUGS-42199 where cert-manager ought to create 300+ certs):
E0518 16:10:22.720047 1 controller.go:167] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="Time limit exceeded. Last error: " "key"="ocm-production-id/cluster-api-cert-zxkfj-2403688073-3486302731
I'd prefer having to PollUntilContextCancel rather than creating new ChangeRecordSet(s) that get stuck again and again. Thoughts @ThatsMrTalbot ?
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: bashlion Once this PR has been reviewed and has the lgtm label, please assign euank for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@swghosh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-cert-manager-master-e2e-v1-32 | 6c9a81b420840c44314c963c22b6ffff64d22bf7 | link | true | /test pull-cert-manager-master-e2e-v1-32 |
| pull-cert-manager-master-e2e-v1-32-upgrade | 6c9a81b420840c44314c963c22b6ffff64d22bf7 | link | true | /test pull-cert-manager-master-e2e-v1-32-upgrade |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
CloudDNS on the contrary retries forever with a wait period of 1s between every status check
@swghosh Please link to a source for this. And link to any discussion about the pros and cons of this behaviour.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 3 months of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 3 months of inactivity.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.