kubernetes-letsencrypt icon indicating copy to clipboard operation
kubernetes-letsencrypt copied to clipboard

Error creating new authz :: too many currently pending authorizations

Open drigz opened this issue 7 years ago • 5 comments

Using kubernetes-letsencrypt v1.7 with Cloud DNS and GKE, we've observed a "too many currently pending authorizations" error. This is surprising, since the limit is 300 pending authorizations, but we only have ~10 certificates on the domain. kubernetes-letsencrypt was previously working fine, but when a new team member tried to bring up their own cluster, they ran into this issue.

On the Let's Encrypt forums, schoen said:

So I think the likeliest interpretation is [...] it sometimes request an authorization and then not use it (either requesting an authorization when not requesting a certificate, or requesting an authorization and then crashing or exiting before the corresponding certificate can be requested). This could, for example, be a renewal-related bug if one part of the code says "this certificate should be renewed now" but another part of the code says "this certificate is not yet due for renewal".

and

Maybe this does lead to some useful guidance for client developers: if you get an authz for one requested domain but fail to get it for another, make sure you proactively destroy the first authz before giving up. (If your error was based on repeated failed attempts to get a certificate for a mixture of names you do and don't control, that might be the underlying problem here.)

Is that possible? If we see it again, what can we do to get more debug information?

org.shredzone.acme4j.exception.AcmeRateLimitExceededException: Error creating new authz :: too many currently pending authorizations
        at org.shredzone.acme4j.connector.DefaultConnection.createAcmeException(DefaultConnection.java:394)
        at org.shredzone.acme4j.connector.DefaultConnection.accept(DefaultConnection.java:199)
        at org.shredzone.acme4j.Registration.authorizeDomain(Registration.java:189)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.getAuthorization(CertificateRequestHandler.kt:90)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.authorizeDomain(CertificateRequestHandler.kt:68)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.access$authorizeDomain(CertificateRequestHandler.kt:27)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:41)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:27)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.Collections$2.tryAdvance(Collections.java:4717)
        at java.util.Collections$2.forEachRemaining(Collections.java:4725)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
        at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
        at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.requestCertificate(CertificateRequestHandler.kt:41)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.handleCertificateRequest(ServiceManager.kt:64)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.access$handleCertificateRequest(ServiceManager.kt:20)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager$reconcileService$1.run(ServiceManager.kt:45)
        at java.lang.Thread.run(Thread.java:745)

drigz avatar Aug 21 '17 10:08 drigz

I've looked in the logs for the kubernetes-letsencrypt and noticed two things.

One: the CloudDnsResponder threw an exception early on:

Exception in thread "Thread-2" java.lang.UnsupportedOperationException: Empty collection can't be reduced.
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.findMatchingZone(CloudDnsResponder.kt:123)
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.updateCloudDnsRecord(CloudDnsResponder.kt:55)
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.addChallengeRecord(CloudDnsResponder.kt:26)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.prepareDnsChallenge(CertificateRequestHandler.kt:176)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.authorizeDomain(CertificateRequestHandler.kt:77)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.access$authorizeDomain(CertificateRequestHandler.kt:27)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:41)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:27)
    [SNIP: java.util.stream.*]
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.requestCertificate(CertificateRequestHandler.kt:41)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.handleCertificateRequest(ServiceManager.kt:64)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.access$handleCertificateRequest(ServiceManager.kt:20)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager$reconcileService$1.run(ServiceManager.kt:45)
    at java.lang.Thread.run(Thread.java:745)

This appears to be because our Cloud DNS configuration had the wrong zone, so the responder didn't work.

Two: this error occurs 300 times before the rate limit error takes its place. This takes about an hour because the operation is retried very frequently. The retries continue, leading to rate limit errors every 45 seconds or so.

Two things that could help this:

  • The authz should be deleted if the CloudDnsResponder crashes, to avoid hitting the "pending authorizations" limit.
  • Exponential backoff should be used in case of failures.

drigz avatar Aug 21 '17 10:08 drigz

Thanks for reporting this, I'll look into handling this more gracefully!

tazjin avatar Aug 21 '17 11:08 tazjin

Thanks! FYI, as a workaround, we deleted the letsencrypt-keypair secret. This makes kubernetes-letsencrypt create a new user with an empty quota.

kubectl --namespace kube-system delete secret letsencrypt-keypair

drigz avatar Aug 21 '17 12:08 drigz

Note: LE just enabled pending authorization recycling, which might (help) avoid this issue:

https://community.letsencrypt.org/t/automatic-recycling-of-pending-authorizations/41321

drigz avatar Aug 31 '17 18:08 drigz

Interesting! I started working on the issues you reported yesterday - but time is currently a scarce resource :-)

tazjin avatar Sep 01 '17 09:09 tazjin