cert-manager icon indicating copy to clipboard operation
cert-manager copied to clipboard

fix: ❗dns-01 route53 query change status retry timeout

Open swghosh opened this issue 1 year ago • 12 comments

Pull Request Motivation

Today, dns-01 challenges over Route53 queries the dns record change for a time out period of hardcoded 2 minutes. This causes certificates to not get ready with multiple failing challenges on large clusters where multiple ACME certificates are being issued through LetsEncrypt (we have been seeing multiple reports of such issues for quite sometime now).

CloudDNS on the contrary retries forever with a wait period of 1s between every status check.

Kind

/kind bug

Release Note

ACME DNS-01 challenges for Route53 would query the dns record changes in Pending state and time out after a hardcoded period of 2m0s, this would cause multiple failed challenges and unready certificates for clusters requesting many certificates. The change record status query call was fixed to retry forever until they have been processed by Route53.

swghosh avatar Dec 12 '24 13:12 swghosh

Hi @swghosh. Thanks for your PR.

I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cert-manager-prow[bot] avatar Dec 12 '24 13:12 cert-manager-prow[bot]

/ok-to-test

inteon avatar Dec 12 '24 14:12 inteon

Not sure how I feel about introducing an infinite loop, looking at it even a context close will not stop it (which it looks like would not have stopped the old code either, which I would be in favour of fixing).

ThatsMrTalbot avatar Dec 12 '24 16:12 ThatsMrTalbot

Probably worth a chat about if we want to poll forever, especially when AWS document that any changes should be available within 60 seconds

If we are ok with it blocking forever, I think wait.PollUntilContextCancel would be better, as it handles the context close for us.

ThatsMrTalbot avatar Dec 12 '24 20:12 ThatsMrTalbot

When propagation is complete, GetChange returns a status of INSYNC. Changes generally propagate to all Route 53 name servers managing the hosted zone within 60 seconds. [source]

This is however no guarantee, just a general time that can be expected. We currently poll for 2 minutes (hardcoded) then create a new ChangeRecordSet event. If AWS has an issue in their backend, we will skip ahead to the next ChangeRecordSet and poll that again for 2 minutes, infinitely until AWS resolves their issue and ChangeRecordSet get into UPSERT/applied state within 2 minutes again.

typeid avatar Dec 16 '24 13:12 typeid

When propagation is complete, GetChange returns a status of INSYNC. Changes generally propagate to all Route 53 name servers managing the hosted zone within 60 seconds. [source]

This is however no guarantee, just a general time that can be expected. We currently poll for 2 minutes then create a new ChangeRecordSet event. If AWS has an issue in their backend, we will skip ahead to the next ChangeRecordSet and poll that again for 2 minutes, infinitely until AWS resolves their issue and ChangeRecordSet get into UPSERT/applied state within 2 minutes again.

@typeid despite the UPSERT, challenges get stuck forever and there are traces of the error in reported logs (as seen in OCPBUGS-42199 where cert-manager ought to create 300+ certs):

E0518 16:10:22.720047       1 controller.go:167] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="Time limit exceeded. Last error: " "key"="ocm-production-id/cluster-api-cert-zxkfj-2403688073-3486302731 

I'd prefer having to PollUntilContextCancel rather than creating new ChangeRecordSet(s) that get stuck again and again. Thoughts @ThatsMrTalbot ?

swghosh avatar Dec 16 '24 13:12 swghosh

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bashlion Once this PR has been reviewed and has the lgtm label, please assign euank for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

cert-manager-prow[bot] avatar Jan 03 '25 07:01 cert-manager-prow[bot]

@swghosh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cert-manager-master-e2e-v1-32 6c9a81b420840c44314c963c22b6ffff64d22bf7 link true /test pull-cert-manager-master-e2e-v1-32
pull-cert-manager-master-e2e-v1-32-upgrade 6c9a81b420840c44314c963c22b6ffff64d22bf7 link true /test pull-cert-manager-master-e2e-v1-32-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

cert-manager-prow[bot] avatar Jan 27 '25 13:01 cert-manager-prow[bot]

CloudDNS on the contrary retries forever with a wait period of 1s between every status check

@swghosh Please link to a source for this. And link to any discussion about the pros and cons of this behaviour.

wallrj avatar Apr 16 '25 14:04 wallrj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

cert-manager-bot avatar Jul 15 '25 14:07 cert-manager-bot

Stale issues rot after 3 months of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 3 months of inactivity. If this issue is safe to close now please do so with /close. /lifecycle rotten /remove-lifecycle stale

cert-manager-bot avatar Oct 14 '25 15:10 cert-manager-bot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cert-manager-prow[bot] avatar Nov 04 '25 19:11 cert-manager-prow[bot]