cert-manager Challenge Records Not Always Cleaned Up

Describe the bug: After successfully completing dns-01 challenges, cert-manager is not always cleaning up the TXT records it created

Expected behaviour: All DNS records related to challenges should be deleted once completed.

Steps to reproduce the bug: TBC.

I currently cannot consistently reproduce the issue

Anything else we need to know?:

Environment details::

Kubernetes version: v1.17.14-gke.1600
Cloud-provider/provisioner: GKE
cert-manager version: 1.1.0
Install method: Custom helm chart

The issue only seems to affect challenge records provisioned in Google Cloud DNS, we don't see the same thing for Cloudflare DNS (Though about 95% of challenges are via Cloud DNS)

I can see in the GCP logging for one example the API requests to create the record, but no requests to later delete the record.

/kind bug

Feb 08 '21 16:02 Evesy

Hi! It looks like a bug during the "finalizer" stage; when the issue happens, would you be able to share the cert-manager-controller logs?

/triage needs-information

Feb 09 '21 08:02 maelvls

Hi! It looks like a bug during the "finalized" stage; when the issue happens, would you be able to share the cert-manager-controller logs?

/triage needs-information

Absolutely. I've increased the logging around cert-manager and will grab a copy of the logs the next time it happens

Feb 09 '21 18:02 Evesy

Hey @maelvls -- Logs are here. Best I could do was a csv as they were exported from Kibana, cheers

Feb 11 '21 13:02 Evesy

After almost 30 minutes into investigating the logs, I realized that I was looking at anti-chronological entries 😅

I then was surprised by the absence of a line that would say "finalizer" (something like controller/challenges/finalizer). The removal of the TXT records happens in acmechallenges/sync.go, and it seems like the Challenge object never gets deleted maybe?

The challenge itself seems to be properly deleted (I mean, metadata.deletionTime becomes non-null):

sync.go:101] controller/orders msg="Order has already been completed, cleaning up any owned Challenge resources" resource_kind="Order" resource_name="sauron-adverts-evo-app-tls-78s5d-3403441770" "resource_namespace"="sauron-adverts-evo-app" "resource_version"="v1"
round_trippers.go:443] DELETE https://10.192.0.1:443/apis/acme.cert-manager.io/v1/namespaces/sauron-adverts-evo-app/challenges/sauron-adverts-evo-app-tls-78s5d-3403441770-1727866623 200 OK in 4 milliseconds

Not sure why the finalizer logs don't show :(

Feb 23 '21 15:02 maelvls

Hey, is there any more information you need on this? We're still seeing a quite a lot of challenge records left around after the certificate issuance.

Happy to collect anything that'd be useful to help debug

Aug 09 '21 12:08 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Nov 07 '21 12:11 jetstack-bot

/remove-lifecycle stale

Nov 08 '21 11:11 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Feb 06 '22 12:02 jetstack-bot

/remove-lifecycle stale

This is still occurring as of 1.6

Feb 08 '22 14:02 Evesy

I've been looking at the code and noticed a few problems and potential cleanups:

[x] Missing unit-tests
[x] https://github.com/cert-manager/cert-manager/pull/5121
[ ] https://github.com/cert-manager/cert-manager/pull/5126
- [ ] Challenge Finalizer is always removed, regardless of whether solver.cleanup succeeds
- [ ] Challenge Finalizer is assumed to be the only (first) finalizer (breaks if external controllers add their own finalizers)

May 10 '22 14:05 wallrj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Aug 12 '22 16:08 jetstack-bot

/remove-lifecycle stale

Aug 12 '22 21:08 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Nov 10 '22 21:11 jetstack-bot

/remove-lifecycle stale

Nov 11 '22 10:11 Evesy

@wallrj Hi, are there any plans to continue with the open PR to progress towards a fix for challenge records not always being cleaned up?

Jan 03 '23 15:01 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

May 15 '23 11:05 jetstack-bot

/remove-lifecycle stale

I run into the same issue with DigitalOcean DNS services, which contains a lot of TXT record for the DNS challenge.

May 15 '23 11:05 mecseid

same issue here, also with DigitalOcean (didn't try other DNS services) ! It's a bit annoying.

Jul 21 '23 07:07 maaft

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Oct 19 '23 07:10 jetstack-bot

/remove-lifecycle stale

Oct 20 '23 12:10 Evesy

Are there any updates on this? We're experiencing the same behavior in 1.13.3 with the azureDNS solver, but only with delegated domains. The regular subdomains in the same dns zone are cleaned up like normal.

Dec 19 '23 19:12 mycarrysun

Any update here?

Jan 19 '24 08:01 D3CK3R

The digital ocean TXT records keep piling up.

After several renews the TXT records gets too large which exceeds max response size and lets encrypt refuses to parse it https://community.letsencrypt.org/t/max-response-size-for-dns-01/122700/6

Is there a solution to the TXT records clean up issue?

Jan 24 '24 08:01 smeng9

Any simple workaround for this? We have hundreds of records in our DNS

Jan 24 '24 10:01 D3CK3R

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

May 03 '24 16:05 cert-manager-bot

/remove-lifecycle stale

May 03 '24 16:05 mycarrysun

This is a problem, the behaviour here leads to issues with rate limiting as different DNS automations like cert-manager and external-dns have to perform more and more queries to check all pages.

Jul 02 '24 18:07 Routhinator

cert-manager cert-manager copied to clipboard

Challenge Records Not Always Cleaned Up

cert-manager
cert-manager copied to clipboard