cert-manager
cert-manager copied to clipboard
Challenge Records Not Always Cleaned Up
Describe the bug: After successfully completing dns-01 challenges, cert-manager is not always cleaning up the TXT records it created
Expected behaviour: All DNS records related to challenges should be deleted once completed.
Steps to reproduce the bug: TBC.
I currently cannot consistently reproduce the issue
Anything else we need to know?:
Environment details::
- Kubernetes version: v1.17.14-gke.1600
- Cloud-provider/provisioner: GKE
- cert-manager version: 1.1.0
- Install method: Custom helm chart
The issue only seems to affect challenge records provisioned in Google Cloud DNS, we don't see the same thing for Cloudflare DNS (Though about 95% of challenges are via Cloud DNS)
I can see in the GCP logging for one example the API requests to create the record, but no requests to later delete the record.
/kind bug
Hi! It looks like a bug during the "finalizer" stage; when the issue happens, would you be able to share the cert-manager-controller logs?
/triage needs-information
Hi! It looks like a bug during the "finalized" stage; when the issue happens, would you be able to share the cert-manager-controller logs?
/triage needs-information
Absolutely. I've increased the logging around cert-manager and will grab a copy of the logs the next time it happens
Hey @maelvls -- Logs are here. Best I could do was a csv as they were exported from Kibana, cheers
After almost 30 minutes into investigating the logs, I realized that I was looking at anti-chronological entries 😅
I then was surprised by the absence of a line that would say "finalizer" (something like controller/challenges/finalizer
). The removal of the TXT records happens in acmechallenges/sync.go, and it seems like the Challenge object never gets deleted maybe?
The challenge itself seems to be properly deleted (I mean, metadata.deletionTime becomes non-null):
sync.go:101] controller/orders msg="Order has already been completed, cleaning up any owned Challenge resources" resource_kind="Order" resource_name="sauron-adverts-evo-app-tls-78s5d-3403441770" "resource_namespace"="sauron-adverts-evo-app" "resource_version"="v1"
round_trippers.go:443] DELETE https://10.192.0.1:443/apis/acme.cert-manager.io/v1/namespaces/sauron-adverts-evo-app/challenges/sauron-adverts-evo-app-tls-78s5d-3403441770-1727866623 200 OK in 4 milliseconds
Not sure why the finalizer logs don't show :(
Hey, is there any more information you need on this? We're still seeing a quite a lot of challenge records left around after the certificate issuance.
Happy to collect anything that'd be useful to help debug
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
This is still occurring as of 1.6
I've been looking at the code and noticed a few problems and potential cleanups:
- [x] Missing unit-tests
- [x] https://github.com/cert-manager/cert-manager/pull/5121
- [ ] https://github.com/cert-manager/cert-manager/pull/5126
- [ ] Challenge Finalizer is always removed, regardless of whether solver.cleanup succeeds
- [ ] Challenge Finalizer is assumed to be the only (first) finalizer (breaks if external controllers add their own finalizers)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
@wallrj Hi, are there any plans to continue with the open PR to progress towards a fix for challenge records not always being cleaned up?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
I run into the same issue with DigitalOcean DNS services, which contains a lot of TXT record for the DNS challenge.
same issue here, also with DigitalOcean (didn't try other DNS services) ! It's a bit annoying.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Are there any updates on this? We're experiencing the same behavior in 1.13.3 with the azureDNS solver, but only with delegated domains. The regular subdomains in the same dns zone are cleaned up like normal.
Any update here?
The digital ocean TXT records keep piling up.
After several renews the TXT records gets too large which exceeds max response size and lets encrypt refuses to parse it https://community.letsencrypt.org/t/max-response-size-for-dns-01/122700/6
Is there a solution to the TXT records clean up issue?
Any simple workaround for this? We have hundreds of records in our DNS
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
This is a problem, the behaviour here leads to issues with rate limiting as different DNS automations like cert-manager and external-dns have to perform more and more queries to check all pages.