cert-manager icon indicating copy to clipboard operation
cert-manager copied to clipboard

Challenge Records Not Always Cleaned Up

Open Evesy opened this issue 3 years ago • 26 comments

Describe the bug: After successfully completing dns-01 challenges, cert-manager is not always cleaning up the TXT records it created

Expected behaviour: All DNS records related to challenges should be deleted once completed.

Steps to reproduce the bug: TBC.

I currently cannot consistently reproduce the issue

Anything else we need to know?:

Environment details::

  • Kubernetes version: v1.17.14-gke.1600
  • Cloud-provider/provisioner: GKE
  • cert-manager version: 1.1.0
  • Install method: Custom helm chart

The issue only seems to affect challenge records provisioned in Google Cloud DNS, we don't see the same thing for Cloudflare DNS (Though about 95% of challenges are via Cloud DNS)

I can see in the GCP logging for one example the API requests to create the record, but no requests to later delete the record.

/kind bug

Evesy avatar Feb 08 '21 16:02 Evesy

Hi! It looks like a bug during the "finalizer" stage; when the issue happens, would you be able to share the cert-manager-controller logs?

/triage needs-information

maelvls avatar Feb 09 '21 08:02 maelvls

Hi! It looks like a bug during the "finalized" stage; when the issue happens, would you be able to share the cert-manager-controller logs?

/triage needs-information

Absolutely. I've increased the logging around cert-manager and will grab a copy of the logs the next time it happens

Evesy avatar Feb 09 '21 18:02 Evesy

Hey @maelvls -- Logs are here. Best I could do was a csv as they were exported from Kibana, cheers

Evesy avatar Feb 11 '21 13:02 Evesy

After almost 30 minutes into investigating the logs, I realized that I was looking at anti-chronological entries 😅

I then was surprised by the absence of a line that would say "finalizer" (something like controller/challenges/finalizer). The removal of the TXT records happens in acmechallenges/sync.go, and it seems like the Challenge object never gets deleted maybe?

The challenge itself seems to be properly deleted (I mean, metadata.deletionTime becomes non-null):

sync.go:101] controller/orders msg="Order has already been completed, cleaning up any owned Challenge resources" resource_kind="Order" resource_name="sauron-adverts-evo-app-tls-78s5d-3403441770" "resource_namespace"="sauron-adverts-evo-app" "resource_version"="v1"
round_trippers.go:443] DELETE https://10.192.0.1:443/apis/acme.cert-manager.io/v1/namespaces/sauron-adverts-evo-app/challenges/sauron-adverts-evo-app-tls-78s5d-3403441770-1727866623 200 OK in 4 milliseconds

Not sure why the finalizer logs don't show :(

maelvls avatar Feb 23 '21 15:02 maelvls

Hey, is there any more information you need on this? We're still seeing a quite a lot of challenge records left around after the certificate issuance.

Happy to collect anything that'd be useful to help debug

Evesy avatar Aug 09 '21 12:08 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Nov 07 '21 12:11 jetstack-bot

/remove-lifecycle stale

Evesy avatar Nov 08 '21 11:11 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Feb 06 '22 12:02 jetstack-bot

/remove-lifecycle stale

This is still occurring as of 1.6

Evesy avatar Feb 08 '22 14:02 Evesy

I've been looking at the code and noticed a few problems and potential cleanups:

  • [x] Missing unit-tests
  • [x] https://github.com/cert-manager/cert-manager/pull/5121
  • [ ] https://github.com/cert-manager/cert-manager/pull/5126
    • [ ] Challenge Finalizer is always removed, regardless of whether solver.cleanup succeeds
    • [ ] Challenge Finalizer is assumed to be the only (first) finalizer (breaks if external controllers add their own finalizers)

wallrj avatar May 10 '22 14:05 wallrj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Aug 12 '22 16:08 jetstack-bot

/remove-lifecycle stale

Evesy avatar Aug 12 '22 21:08 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Nov 10 '22 21:11 jetstack-bot

/remove-lifecycle stale

Evesy avatar Nov 11 '22 10:11 Evesy

@wallrj Hi, are there any plans to continue with the open PR to progress towards a fix for challenge records not always being cleaned up?

Evesy avatar Jan 03 '23 15:01 Evesy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar May 15 '23 11:05 jetstack-bot

/remove-lifecycle stale

I run into the same issue with DigitalOcean DNS services, which contains a lot of TXT record for the DNS challenge.

mecseid avatar May 15 '23 11:05 mecseid

same issue here, also with DigitalOcean (didn't try other DNS services) ! It's a bit annoying.

maaft avatar Jul 21 '23 07:07 maaft

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Oct 19 '23 07:10 jetstack-bot

/remove-lifecycle stale

Evesy avatar Oct 20 '23 12:10 Evesy

Are there any updates on this? We're experiencing the same behavior in 1.13.3 with the azureDNS solver, but only with delegated domains. The regular subdomains in the same dns zone are cleaned up like normal.

mycarrysun avatar Dec 19 '23 19:12 mycarrysun

Any update here?

D3CK3R avatar Jan 19 '24 08:01 D3CK3R

The digital ocean TXT records keep piling up.

After several renews the TXT records gets too large which exceeds max response size and lets encrypt refuses to parse it https://community.letsencrypt.org/t/max-response-size-for-dns-01/122700/6

Is there a solution to the TXT records clean up issue?

smeng9 avatar Jan 24 '24 08:01 smeng9

Any simple workaround for this? We have hundreds of records in our DNS

D3CK3R avatar Jan 24 '24 10:01 D3CK3R

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

cert-manager-bot avatar May 03 '24 16:05 cert-manager-bot

/remove-lifecycle stale

mycarrysun avatar May 03 '24 16:05 mycarrysun